SlideShare una empresa de Scribd logo
1 de 28
Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics  And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI)  Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
Serial Execution of Proteomics Search
Parallel Proteomics Search
Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
The scripts
Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
S3 Costs
Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce  RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
Bowtie Alone
RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support  My Team David Austin Andrew Brader Weichen Wu Rate me!   http://speakerrate.com/talks/3041-everything-comes-in-3-s

Más contenido relacionado

La actualidad más candente

MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
Hortonworks
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
airbots
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNS
bacongobbler
 

La actualidad más candente (20)

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
 
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNS
 

Destacado

Destacado (6)

JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRJavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
 
padrino_and_sequel
padrino_and_sequelpadrino_and_sequel
padrino_and_sequel
 
Ruby FFI
Ruby FFIRuby FFI
Ruby FFI
 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problem
 
Itmat pcbi-r-course-1
Itmat pcbi-r-course-1Itmat pcbi-r-course-1
Itmat pcbi-r-course-1
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couch
 

Similar a Everything comes in 3's

Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
Sri Prasanna
 

Similar a Everything comes in 3's (20)

AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 Summary
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Cloud C
Cloud CCloud C
Cloud C
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Tombolo
TomboloTombolo
Tombolo
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloud
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS Applications
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Greenfield Development with CQRS
Greenfield Development with CQRSGreenfield Development with CQRS
Greenfield Development with CQRS
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Everything comes in 3's

  • 1. Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
  • 2. Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
  • 3. Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
  • 4. System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
  • 5. Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
  • 6. Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
  • 7. More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
  • 8. Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI) Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
  • 9. Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
  • 10. Serial Execution of Proteomics Search
  • 12. Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
  • 13. Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
  • 14. Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
  • 16. Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
  • 18. Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
  • 19. DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
  • 20. Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
  • 21. Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
  • 22. Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
  • 23. Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
  • 24. Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
  • 26. RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
  • 27. RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
  • 28. Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support My Team David Austin Andrew Brader Weichen Wu Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

Notas del editor

  1. REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.http://www.mpiblast.org/About/Publications