SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Why is Bioinformatics 
(well, really, “genomics”) 
a Good Fit for Spark? 
Timothy Danford 
AMPLab
A One-Slide Introduction to Genomics
Bioinformatics computation is batch 
processing and workflows 
● Bioinformatics has a lot of 
“workflow engines” 
○ Galaxy, Taverna, Firehose, Zamboni, 
Queue, Luigi, bPipe 
○ bash scripts 
○ even make, fer cryin’ out loud 
○ a new one every day 
● Bioinformatics software 
development is still largely a 
research activity
State-of-the-Art infrastructure: 
shared filesystems, handwritten parallelism 
● Hand-written task creation 
● File formats instead of APIs or 
data models 
○ formats are poorly defined 
○ contain optional or 
redundant fields 
○ semantics are unclear 
● Workflow engines can’t take 
advantage of common 
parallelism between stages
So, why Spark?
Most of Genomics is 1-D Geometry
Most of Genomics is 1-D Geometry
The rest is iterative evaluation of 
probabilistic models!
Spark RDDs and Partitioners allow 
declarative parallelization for genomics 
● Genomics computation 
is parallelized in a small, 
standard number of 
ways 
○ by position 
○ by sample 
● Declarative, flexible 
partitioning schemes 
are useful
Spark can easily express genomics primitives: 
join by genomic overlap 
1. Calculate disjoint 
regions based on left 
(blue) set 
2. Partition both sets by 
disjoint regions 
3. Merge-join within each 
partition 
4. (Optional) aggregation 
across joined pairs
ADAM is Genomics + Spark 
● A rewrite of core bioinformatics tools and algorithms in Spark 
● Combines three 
technologies 
○ Spark 
○ Parquet 
○ Avro 
● Apache 2-licensed 
● Started at the AMPLab 
http://bdgenomics.org/
Avro and Parquet are just as critical to 
ADAM as Spark 
● Avro to define data models 
● Parquet for serialization format 
● Still need to answer design 
questions 
○ how wide are the schemas? 
○ how much do we follow existing 
formats? 
○ how do carry through projections?
Still need to convince bioinformaticians to 
rewrite their software! 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
Still need to convince bioinformaticians to 
rewrite their software! 
● A single piece of a 
single filtering stage 
for a somatic variant 
caller 
● “11-base-pair window 
centered on a candidate 
mutation” actually 
turns out to be 
optimized for a 
particular file format 
and sort order 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
The Future: 
Distributed and Incremental? 
● Today: 5k samples x 20 Gb / sample 
● Tomorrow: 1m+ samples @ 200+ Gb / sample? 
● More and more analysis is aggregative 
○ joint variant calling, 
○ panels of normal samples, 
○ collective variant annotation 
● And “data collection” will never be finished
Acknowledgements 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Carl Yeksigian (DataStax) 
Anthony Philippakis (Broad Institute) 
Jeff Hammerbacher (Cloudera / Mt. Sinai) 
Thank you! 
(questions?)

Más contenido relacionado

La actualidad más candente

Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactJean-Paul Calbimonte
 

La actualidad más candente (20)

Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 

Similar a Why is Bioinformatics a Good Fit for Spark?

Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...InfinIT - Innovationsnetværket for it
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Barbera van Schaik
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu
 
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging FaceINTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Faceapidays
 
Getting Started with SPARK
Getting Started with SPARKGetting Started with SPARK
Getting Started with SPARKAdaCore
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 

Similar a Why is Bioinformatics a Good Fit for Spark? (20)

Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging FaceINTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
 
Getting Started with SPARK
Getting Started with SPARKGetting Started with SPARK
Getting Started with SPARK
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 

Último

Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Call Girls in Nagpur High Profile
 
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...astropune
 
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableDipal Arora
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...chandars293
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...indiancallgirl4rent
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...tanya dube
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomdiscovermytutordmt
 
Bangalore Call Girls Nelamangala Number 9332606886 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 9332606886  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 9332606886  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 9332606886 Meetin With Bangalore Esc...narwatsonia7
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...Arohi Goyal
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 

Último (20)

Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
 
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
 
Bangalore Call Girls Nelamangala Number 9332606886 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 9332606886  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 9332606886  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 9332606886 Meetin With Bangalore Esc...
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
 

Why is Bioinformatics a Good Fit for Spark?

  • 1. Why is Bioinformatics (well, really, “genomics”) a Good Fit for Spark? Timothy Danford AMPLab
  • 3. Bioinformatics computation is batch processing and workflows ● Bioinformatics has a lot of “workflow engines” ○ Galaxy, Taverna, Firehose, Zamboni, Queue, Luigi, bPipe ○ bash scripts ○ even make, fer cryin’ out loud ○ a new one every day ● Bioinformatics software development is still largely a research activity
  • 4. State-of-the-Art infrastructure: shared filesystems, handwritten parallelism ● Hand-written task creation ● File formats instead of APIs or data models ○ formats are poorly defined ○ contain optional or redundant fields ○ semantics are unclear ● Workflow engines can’t take advantage of common parallelism between stages
  • 5.
  • 7. Most of Genomics is 1-D Geometry
  • 8. Most of Genomics is 1-D Geometry
  • 9. The rest is iterative evaluation of probabilistic models!
  • 10. Spark RDDs and Partitioners allow declarative parallelization for genomics ● Genomics computation is parallelized in a small, standard number of ways ○ by position ○ by sample ● Declarative, flexible partitioning schemes are useful
  • 11. Spark can easily express genomics primitives: join by genomic overlap 1. Calculate disjoint regions based on left (blue) set 2. Partition both sets by disjoint regions 3. Merge-join within each partition 4. (Optional) aggregation across joined pairs
  • 12. ADAM is Genomics + Spark ● A rewrite of core bioinformatics tools and algorithms in Spark ● Combines three technologies ○ Spark ○ Parquet ○ Avro ● Apache 2-licensed ● Started at the AMPLab http://bdgenomics.org/
  • 13. Avro and Parquet are just as critical to ADAM as Spark ● Avro to define data models ● Parquet for serialization format ● Still need to answer design questions ○ how wide are the schemas? ○ how much do we follow existing formats? ○ how do carry through projections?
  • 14. Still need to convince bioinformaticians to rewrite their software! Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 15. Still need to convince bioinformaticians to rewrite their software! ● A single piece of a single filtering stage for a somatic variant caller ● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 16. The Future: Distributed and Incremental? ● Today: 5k samples x 20 Gb / sample ● Tomorrow: 1m+ samples @ 200+ Gb / sample? ● More and more analysis is aggregative ○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation ● And “data collection” will never be finished
  • 17. Acknowledgements Matt Massie (AMPLab) Frank Nothaft (AMPLab) Carl Yeksigian (DataStax) Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai) Thank you! (questions?)