SlideShare una empresa de Scribd logo
1 de 36
Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
Natalie Twine,
PhD
Prabha Pillay
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome holds Blueprint for Every Cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Affects Looks, Disease Risk, and Behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
VCF Data
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
BigData Focus
Finding the Disease Gene(s)
Spot the letter that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Why
Apache
Spark?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
lowSpeedhigh
CloudDataPipelinePattern
Business
Problem
Data
Quality
Candidate
Technologies
Build/Test
MVPs
Assemble
Pipeline
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Building a CloudDataPipeline
Candidate
Technologies
• Ingest/Clean
• Analyze/Predict
• Visualize
Build MVPs
• Test
• Iterate
• Learn
Assemble
Pipeline
• Combine pieces
• Validate sections
• Test at scale
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Building a Cloud Data Pipeline
Spark
•IaaS, PaaS, SaaS Vendors
•AWS, Azure, GCP…
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Visualizing Machine Learning Results
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Solving Important Questions…
Cancer genomics?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
DEMO: Who is a Bondi Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel CPUs
• Xeon E5-2660@2.20GHz
• 128 GB RAM
• Spark 1.6.1
• 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
Range
GWAS Range
Future Directions for VariantSpark RF
Mixed feature types
Unordered
Categorical
Continuous
Build Community
Python API
Non-Genomic
Demos
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Implementation by
Try it out: VariantSpark Notebook
Transformational Bioinformatics| Denis C. Bauer @allPowerde
https://docs.databricks.com/spark/latest/training/variant-spark.html
Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
“Editing does not work every time, e.g.
only 7 in 10 embryos were mutation free.”
Aim: Develop computational
guidance framework to
enable edits the first time;
every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Make Process Parallel and Scalable
SPEED
• Each search can be
broken down into parallel
tasks - each takes
seconds
SCALE
• Researchers might want
to search the target for
one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
X-Ray Tracing Demo of GT-Scan2
• Find performance
bottlenecks
• Fix and test
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Webapp
Resources (S3, DynamoDB)
Lambda
25
50
75
getFastaSequence
createJob
targetScan
offtargetScanStarter
offtargetSearch
targetIntersects
targetTranscriptionIntersects
targetW
uScorer
targetSgR
N
AScorer
O
nTargetScorer
genom
eC
R
ISPR
functions
runtime(s)
Type
base
old
GTScan2 X-Ray Analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Results – 4x Faster (80% improvement)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
2 min
30 sec
Considering Services
for GT-Scan2
• Use AWS Step Functions
• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs
• SNS vs. Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CloudDataPipelinePattern
Problem Data Technologies MVPs Pipeline
Search
GTScan2
fastq, bed-> S3, NoSQL Ingest
ETL, Analyze
Viz
S3
Lambda
Lambda/API Gateway
Serverless
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Serverless Pipeline Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users
Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CloudDataPipelinePattern
Problem Data Technologies MVPs Pipeline
Analyze
GWAS
vcf -> S3/Spark Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook, SQL, R, Python
Spark
Server
Cluster
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Spark Server Cluster Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Genomic-Scale Data Pipelines
• Problem # 1 – ML on Large Data
• Solution: Spark-server cluster + custom
machine learning
• Problem #2 – Burstable Search
• Solution: Serverless pipeline
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic-scale Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Dr. Denis Bauer & Lynn Langit

Más contenido relacionado

La actualidad más candente

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...Amazon Web Services
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...ATMOSPHERE .
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark Summit
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemSlideCentral
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-finalmarpierc
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 

La actualidad más candente (19)

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn Creator
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-final
 
Rasdaman use case
Rasdaman use case Rasdaman use case
Rasdaman use case
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 

Similar a Genome-scale Big Data Pipelines

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science researchDenis C. Bauer
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 
VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWSLynn Langit
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteDenis C. Bauer
 
Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Denis C. Bauer
 
SaaS and the Transformation of Research
SaaS and the Transformation of ResearchSaaS and the Transformation of Research
SaaS and the Transformation of ResearchVas Vasiliadis
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Xing Xu
 
Next Generation Manufacturing
Next Generation ManufacturingNext Generation Manufacturing
Next Generation ManufacturingElliot Duff
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astrowebuploader
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystemTraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystemTimeScience
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecJonathan Woodward
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated GenomicsIdan Tohami
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discoverydiannepatricia
 

Similar a Genome-scale Big Data Pipelines (20)

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynote
 
Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
SaaS and the Transformation of Research
SaaS and the Transformation of ResearchSaaS and the Transformation of Research
SaaS and the Transformation of Research
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012
 
Next Generation Manufacturing
Next Generation ManufacturingNext Generation Manufacturing
Next Generation Manufacturing
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astro
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystemTraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd Dec
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated Genomics
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discovery
 

Más de Lynn Langit

Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless ArchitecturesLynn Langit
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids ProgrammingLynn Langit
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina LanguageLynn Langit
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsLynn Langit
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformLynn Langit
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformLynn Langit
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL ServerLynn Langit
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinLynn Langit
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'Lynn Langit
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for DevelopersLynn Langit
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Cloud-centric Internet of Things
Cloud-centric Internet of ThingsCloud-centric Internet of Things
Cloud-centric Internet of ThingsLynn Langit
 

Más de Lynn Langit (20)

Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for Developers
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Cloud-centric Internet of Things
Cloud-centric Internet of ThingsCloud-centric Internet of Things
Cloud-centric Internet of Things
 

Último

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curveAreesha Ahmad
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 

Último (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 

Genome-scale Big Data Pipelines

Notas del editor

  1. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  2. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  3. https://www.genome.gov/18016863/a-brief-guide-to-genomics/ https://www.thinglink.com/scene/617714375666434050
  4. http://images.wisegeek.com/woman-in-greek-tank-top-looking-at-thumb.jpg http://nborganics.com.au/index.php/product/herbs-coriander/
  5. http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/
  6. Using this instead?
  7. Denis: Do we need this ?
  8. https://academics.cloud.databricks.com/#notebook/170398/command/170419 – AND-- http://www.drjasonfox.com/
  9. https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
  10. http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
  11. https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
  12. https://aws.amazon.com/xray/details/
  13. X-ray has helped us isolate the component in our GT-Scan2 pipeline that slows the overall execution time, and how replacing it with a more performant function increased runtime 4 fold (125s to 32s).
  14. X-ray has helped us isolate the component in our GT-Scan2 pipeline that slows the overall execution time, and how replacing it with a more performant function increased runtime 4 fold (125s to 32s).
  15. Recent team presentation - https://www.slideshare.net/AustralianNationalDataService/gtscan2-bringing-bioinformatics-to-the-cloud-may-tech-talk
  16. Quickly access a managed Spark cluster - AWS EC2 / spot instances Link to your data and perform whole genome analysis in real-time