SlideShare una empresa de Scribd logo
1 de 45
BUILDING BETTER 
BIOINFORMATICS 
SOFTWARE 
(WHY THE HECK NOT?) 
C. Titus Brown 
ctb@msu.edu 
Assistant Professor, MMG / CSE 
Michigan State University
BUILDING BETTER 
BIOINFORMATICS 
SOFTWARE 
(WHY THE HECK NOT?) 
C. Titus Brown 
ctb@msu.edu 
A???????? Professor, VetMed, UC Davis
Lansing, Michigan -> Davis, California
Dot plots FTW! 
Brown et al., 2005.
So I said these things… 
“this tipping point was exacerbated by the loss of about 
80% of the worlds data scientists in the 2021 Great 
California Disruption.” 
“[ Benchmarks ] have proven to be stifling of innovation, 
because of the tendency to do incremental improvement.” 
ivory.idyll.org/blog/2014-bosc-keynote.html
So I said these things… 
“this tipping point was exacerbated by the loss of about 
80% of the worlds data scientists in the 2021 Great 
California Disruption.” 
“[ Benchmarks ] have proven to be stifling of 
innovation, because of the tendency to do incremental 
improvement.” 
ivory.idyll.org/blog/2014-bosc-keynote.html
There is a real problem.
There is a massive profusion of software! 
Mick Watson, @BioMickWatson: 
biomickwatson.wordpress.com/20 
12/12/28/an-embargo-on-short-read- 
alignment-software/ 
jeffvictor.deviantart.com
The players, in caricature: 
1. Computer scientists 
2. Software engineers 
3. Data scientists 
4. Statisticians 
5. Biologists
The Computer Scientist 
Fast, sensitive, specific – pick one.
The (Good) Software Engineer 
Does it have any unit tests?
The Data Scientist 
How quickly can I run it, starting from 
scratch?
The Statistician 
What gives me the best p-value?
The Biologist 
What gives me the most publishable 
result?
Problems all along the way… 
1. Computer scientists: build delicate, hard to use, very high 
performance software that solves the wrong problem. 
2. Software engineers: all work for Google. 
3. Data scientists: uses the wrong programs -- because they’re 
actually usable. 
4. Statisticians: only get invited into the project six months after 
all the data is generated. 
5. Biologists: are desperate to find any one of the above that 
know any biology at all.
Example: de novo mRNAseq 
Quality control 
Assembly 
Annotation 
Differential 
expression 
Every one of these 
steps is still an open 
research problem, 
with computational 
challenges and direct 
biological implications!
So: 
1. This is all still research. 
2. We’re unlikely to ever find out the right answer, but will 
merely settle for one that’s not obviously terrible. 
3. Everything is changing all the time: the data generation 
tech, the hardware, the software, the theory... 
4. Who are any of us to judge the value of any particular 
approach?
So: 
1. This is all still research 
2. We’re unlikely to ever find out the right answer, but will 
merely settle for one that’s not obviously terrible. 
3. Everything is changing all the time: the data generation 
tech, the hardware, the software, the theory... 
4. Who are any of us to judge the value of any particular 
approach? 
(Well, sometimes me, when I’m peer 
reviewer #2.)
All hands on deck! 
Quality control 
Assembly 
Annotation 
Differential 
expression 
We need it all! 
• Fast/sensitive/specific 
algorithms; 
• Solid software; 
• Statistical robustness; 
• Biological insight; 
• Well-trained data 
scientists. 
(The best bioinformaticians have multiple personality disorder, or so I tell myself.)
That sort of explains why. 
But this still leaves us with too many 
choices.
Example: de novo mRNAseq 
Quality control 
Assembly 
Annotation 
Differential 
expression 
10-20 packages 
x 
2-5 packages 
x 
5-10 packages 
x 
20-40 packages 
= 2000-40,000 combinations
What’s the solution!? 
Ultimately? All of… 
Whole-workflow evaluations of tools. 
Small tools (see “small tools manifesto”). 
Automation! 
Simulations, synthetic data, mock data, real data. 
Antagonistic data set development (**). 
Tool development driven with use cases. 
Build based on solid command-line workflows. 
Those things called “controls”. 
…and more
Trying out a few approaches…
1. Automate the hell out of everything 
(Ubuntu 14.04, git, make, IPython Notebook, latex)
Time from publication of KAnalyze to our 100% 
reproducible re-evaluation? ~8 hours.
2. Protocols, not pipelines. 
STOP HIDING THE ANALYSIS STEPS. 
BIG BLACK BOXES ARE NOT SMALL 
TOOLS!
Write down what you’re doing… 
https://khmer-protocols.readthedocs.org/
…and add automated end-to-end tests. 
c.f. “literate ReSTing”
3. Drive sustainable software 
development with use cases.
…that are explicit…
…versioned…
…and automated.
4. Put everything in the cloud and 
measure it. 
~40 hours; 
m1.xlarge 
Eel Pond mRNAseq protocol.
5. Compare programs and workflows fairly. 
Genome Reference 
Quality Filtered Diginorm Partition Reinflation 
Velvet - 80.90 83.64 84.57 
IDBA 90.96 91.38 90.52 88.80 
SPAde 
90.42 90.35 89.57 90.02 
s 
Mis-assembled Contig Length 
Velvet - 52071358 44730449 45381867 
IDBA 21777032 20807513 17159671 18684159 
SPAde 
28238787 21506019 14247392 18851571 
s 
Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 
Also! Tip o’ the hat to Michael Barton, nucleotid.es
A super fun way to do reviews! 
• “What a nice new transcriptome assembler! Interesting 
how it doesn’t perform that well on my 10 test data sets.” 
• “Hey, so you make these claims, but I ran your code, 
and…” 
• “Fun fact! Your source code has a syntax error in it – even 
Perl has standards! You’re still sure that’s the script you 
used?” 
• “Here – use our evaluation pipeline, since you clearly 
need something better.” 
The Brown Lab: taking passive aggression to a whole new level!
We breed our own problems. 
Reward the behavior you want to see. 
Let’s level up the field, already.
What are we working on, scientifically 
speaking?
Streaming error correction of genomic, transcriptomic, 
metagenomic data via graph alignment 
Jason Pell, Jordan Fish, Michael Crusoe
Error correction on simulated E. coli data 
TP FP TN FN 
1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8% 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Michael Crusoe, Jordan Fish, Jason Pell
Error correction  variant calling 
Single pass, reference free, tunable, streaming 
online variant calling. 
(Hey, look, ma – a new mapper!)
Infrastructure: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI) 
ivory.idyll.org/blog/2014-moore-ddd-talk.html
AGTA talk on Monday 
• 3:15-4pm – come see me try to convince biomedical 
researchers to share their data! 
• 4-4:30pm – come listen to Ana Conesa talk about multi-omics 
data integration! 
Thanks!

Más contenido relacionado

La actualidad más candente

La actualidad más candente (8)

Greg Wilson - We Know (but ignore) More Than We Think
Greg Wilson - We Know (but ignore) More Than We ThinkGreg Wilson - We Know (but ignore) More Than We Think
Greg Wilson - We Know (but ignore) More Than We Think
 
Fuzzing: Challenges and Reflections
Fuzzing: Challenges and ReflectionsFuzzing: Challenges and Reflections
Fuzzing: Challenges and Reflections
 
You Got Your Engineering in my Data Science - Addressing the Reproducibility ...
You Got Your Engineering in my Data Science - Addressing the Reproducibility ...You Got Your Engineering in my Data Science - Addressing the Reproducibility ...
You Got Your Engineering in my Data Science - Addressing the Reproducibility ...
 
More Aim, Less Blame: How to use postmortems to turn failures into something ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...More Aim, Less Blame: How to use postmortems to turn failures into something ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...
 
DS3 Fuzzing Panel (M. Boehme)
DS3 Fuzzing Panel (M. Boehme)DS3 Fuzzing Panel (M. Boehme)
DS3 Fuzzing Panel (M. Boehme)
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
 
Preventing Information Flow with Jeeves - Singapore Data Privacy Workshop
Preventing Information Flow with Jeeves - Singapore Data Privacy WorkshopPreventing Information Flow with Jeeves - Singapore Data Privacy Workshop
Preventing Information Flow with Jeeves - Singapore Data Privacy Workshop
 

Destacado

Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010
Gaurab Dutta
 
Avysta Presentation
Avysta PresentationAvysta Presentation
Avysta Presentation
guest95d5ba
 
2013 beacon-congress-social-media
2013 beacon-congress-social-media2013 beacon-congress-social-media
2013 beacon-congress-social-media
c.titus.brown
 
Museo Virtual De La Escuelaeste
Museo Virtual De La EscuelaesteMuseo Virtual De La Escuelaeste
Museo Virtual De La Escuelaeste
guest09551a
 
2009 Business Breakfast Slideshow
2009 Business Breakfast Slideshow2009 Business Breakfast Slideshow
2009 Business Breakfast Slideshow
UWTSA
 
iPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamismiPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamism
Clément Escoffier
 
VAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipVAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnership
reginal97
 

Destacado (20)

Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
 
Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010
 
Sceneries
SceneriesSceneries
Sceneries
 
Avysta Presentation
Avysta PresentationAvysta Presentation
Avysta Presentation
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2nd
 
About BMC
About BMCAbout BMC
About BMC
 
Know Your Enemy
Know Your EnemyKnow Your Enemy
Know Your Enemy
 
Ferrari
FerrariFerrari
Ferrari
 
Langkah Membuat Blogspot
Langkah Membuat BlogspotLangkah Membuat Blogspot
Langkah Membuat Blogspot
 
Peraturan makmal dan etika internet
Peraturan makmal dan etika internetPeraturan makmal dan etika internet
Peraturan makmal dan etika internet
 
Nursing Skills
Nursing SkillsNursing Skills
Nursing Skills
 
2013 beacon-congress-social-media
2013 beacon-congress-social-media2013 beacon-congress-social-media
2013 beacon-congress-social-media
 
Museo Virtual De La Escuelaeste
Museo Virtual De La EscuelaesteMuseo Virtual De La Escuelaeste
Museo Virtual De La Escuelaeste
 
2009 Business Breakfast Slideshow
2009 Business Breakfast Slideshow2009 Business Breakfast Slideshow
2009 Business Breakfast Slideshow
 
2013 arizona-swc
2013 arizona-swc2013 arizona-swc
2013 arizona-swc
 
Review Adobe Wallaby
Review Adobe WallabyReview Adobe Wallaby
Review Adobe Wallaby
 
靜觀
靜觀靜觀
靜觀
 
iPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamismiPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamism
 
h-ubu : CDI in JavaScript
h-ubu : CDI in JavaScripth-ubu : CDI in JavaScript
h-ubu : CDI in JavaScript
 
VAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipVAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnership
 

Similar a 2014 abic-talk

BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
c.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 

Similar a 2014 abic-talk (20)

2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
Software testing
Software testingSoftware testing
Software testing
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Why Do Computational Scientists Trust Their So
Why Do Computational Scientists Trust Their SoWhy Do Computational Scientists Trust Their So
Why Do Computational Scientists Trust Their So
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
The limits of unit testing by Craig Stuntz
The limits of unit testing by Craig StuntzThe limits of unit testing by Craig Stuntz
The limits of unit testing by Craig Stuntz
 
The Limits of Unit Testing by Craig Stuntz
The Limits of Unit Testing by Craig StuntzThe Limits of Unit Testing by Craig Stuntz
The Limits of Unit Testing by Craig Stuntz
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 

Más de c.titus.brown

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 

Más de c.titus.brown (20)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Último

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Último (20)

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 

2014 abic-talk

  • 1. BUILDING BETTER BIOINFORMATICS SOFTWARE (WHY THE HECK NOT?) C. Titus Brown ctb@msu.edu Assistant Professor, MMG / CSE Michigan State University
  • 2. BUILDING BETTER BIOINFORMATICS SOFTWARE (WHY THE HECK NOT?) C. Titus Brown ctb@msu.edu A???????? Professor, VetMed, UC Davis
  • 3. Lansing, Michigan -> Davis, California
  • 4. Dot plots FTW! Brown et al., 2005.
  • 5. So I said these things… “this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.” “[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.” ivory.idyll.org/blog/2014-bosc-keynote.html
  • 6. So I said these things… “this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.” “[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.” ivory.idyll.org/blog/2014-bosc-keynote.html
  • 7.
  • 8. There is a real problem.
  • 9. There is a massive profusion of software! Mick Watson, @BioMickWatson: biomickwatson.wordpress.com/20 12/12/28/an-embargo-on-short-read- alignment-software/ jeffvictor.deviantart.com
  • 10. The players, in caricature: 1. Computer scientists 2. Software engineers 3. Data scientists 4. Statisticians 5. Biologists
  • 11. The Computer Scientist Fast, sensitive, specific – pick one.
  • 12. The (Good) Software Engineer Does it have any unit tests?
  • 13. The Data Scientist How quickly can I run it, starting from scratch?
  • 14. The Statistician What gives me the best p-value?
  • 15. The Biologist What gives me the most publishable result?
  • 16. Problems all along the way… 1. Computer scientists: build delicate, hard to use, very high performance software that solves the wrong problem. 2. Software engineers: all work for Google. 3. Data scientists: uses the wrong programs -- because they’re actually usable. 4. Statisticians: only get invited into the project six months after all the data is generated. 5. Biologists: are desperate to find any one of the above that know any biology at all.
  • 17. Example: de novo mRNAseq Quality control Assembly Annotation Differential expression Every one of these steps is still an open research problem, with computational challenges and direct biological implications!
  • 18. So: 1. This is all still research. 2. We’re unlikely to ever find out the right answer, but will merely settle for one that’s not obviously terrible. 3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory... 4. Who are any of us to judge the value of any particular approach?
  • 19. So: 1. This is all still research 2. We’re unlikely to ever find out the right answer, but will merely settle for one that’s not obviously terrible. 3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory... 4. Who are any of us to judge the value of any particular approach? (Well, sometimes me, when I’m peer reviewer #2.)
  • 20. All hands on deck! Quality control Assembly Annotation Differential expression We need it all! • Fast/sensitive/specific algorithms; • Solid software; • Statistical robustness; • Biological insight; • Well-trained data scientists. (The best bioinformaticians have multiple personality disorder, or so I tell myself.)
  • 21. That sort of explains why. But this still leaves us with too many choices.
  • 22. Example: de novo mRNAseq Quality control Assembly Annotation Differential expression 10-20 packages x 2-5 packages x 5-10 packages x 20-40 packages = 2000-40,000 combinations
  • 23. What’s the solution!? Ultimately? All of… Whole-workflow evaluations of tools. Small tools (see “small tools manifesto”). Automation! Simulations, synthetic data, mock data, real data. Antagonistic data set development (**). Tool development driven with use cases. Build based on solid command-line workflows. Those things called “controls”. …and more
  • 24. Trying out a few approaches…
  • 25. 1. Automate the hell out of everything (Ubuntu 14.04, git, make, IPython Notebook, latex)
  • 26. Time from publication of KAnalyze to our 100% reproducible re-evaluation? ~8 hours.
  • 27. 2. Protocols, not pipelines. STOP HIDING THE ANALYSIS STEPS. BIG BLACK BOXES ARE NOT SMALL TOOLS!
  • 28. Write down what you’re doing… https://khmer-protocols.readthedocs.org/
  • 29. …and add automated end-to-end tests. c.f. “literate ReSTing”
  • 30.
  • 31. 3. Drive sustainable software development with use cases.
  • 35. 4. Put everything in the cloud and measure it. ~40 hours; m1.xlarge Eel Pond mRNAseq protocol.
  • 36. 5. Compare programs and workflows fairly. Genome Reference Quality Filtered Diginorm Partition Reinflation Velvet - 80.90 83.64 84.57 IDBA 90.96 91.38 90.52 88.80 SPAde 90.42 90.35 89.57 90.02 s Mis-assembled Contig Length Velvet - 52071358 44730449 45381867 IDBA 21777032 20807513 17159671 18684159 SPAde 28238787 21506019 14247392 18851571 s Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 Also! Tip o’ the hat to Michael Barton, nucleotid.es
  • 37. A super fun way to do reviews! • “What a nice new transcriptome assembler! Interesting how it doesn’t perform that well on my 10 test data sets.” • “Hey, so you make these claims, but I ran your code, and…” • “Fun fact! Your source code has a syntax error in it – even Perl has standards! You’re still sure that’s the script you used?” • “Here – use our evaluation pipeline, since you clearly need something better.” The Brown Lab: taking passive aggression to a whole new level!
  • 38. We breed our own problems. Reward the behavior you want to see. Let’s level up the field, already.
  • 39.
  • 40. What are we working on, scientifically speaking?
  • 41. Streaming error correction of genomic, transcriptomic, metagenomic data via graph alignment Jason Pell, Jordan Fish, Michael Crusoe
  • 42. Error correction on simulated E. coli data TP FP TN FN 1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Michael Crusoe, Jordan Fish, Jason Pell
  • 43. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling. (Hey, look, ma – a new mapper!)
  • 44. Infrastructure: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html
  • 45. AGTA talk on Monday • 3:15-4pm – come see me try to convince biomedical researchers to share their data! • 4-4:30pm – come listen to Ana Conesa talk about multi-omics data integration! Thanks!

Notas del editor

  1. Update from Jordan
  2. Analyze data in cloud; import and export important; connect to other databases.