SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics  And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI)  Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
Serial Execution of Proteomics Search
Parallel Proteomics Search
Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
The scripts
Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
S3 Costs
Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce  RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
Bowtie Alone
RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support  My Team David Austin Andrew Brader Weichen Wu Rate me!   http://speakerrate.com/talks/3041-everything-comes-in-3-s

Más contenido relacionado

La actualidad más candente

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime Linaro
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtosNAVER D2
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBbmbouter
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUseHortonworks
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCynthia Thomas
 
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...DigitalOcean
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...NECST Lab @ Politecnico di Milano
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNSbacongobbler
 

La actualidad más candente (20)

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
 
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNS
 

Destacado

JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRJavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRRyan Sciampacone
 
padrino_and_sequel
padrino_and_sequelpadrino_and_sequel
padrino_and_sequeldelagoya
 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemdelagoya
 
Itmat pcbi-r-course-1
Itmat pcbi-r-course-1Itmat pcbi-r-course-1
Itmat pcbi-r-course-1delagoya
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couchdelagoya
 

Destacado (6)

JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRJavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
 
padrino_and_sequel
padrino_and_sequelpadrino_and_sequel
padrino_and_sequel
 
Ruby FFI
Ruby FFIRuby FFI
Ruby FFI
 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problem
 
Itmat pcbi-r-course-1
Itmat pcbi-r-course-1Itmat pcbi-r-course-1
Itmat pcbi-r-course-1
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couch
 

Similar a Everything comes in 3's

AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 SummaryAshish Mrig
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloudawesomesos
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecturewlscaudill
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsExpertos en TI
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)Sri Prasanna
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Ramprasad Nagaraja
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Greenfield Development with CQRS
Greenfield Development with CQRSGreenfield Development with CQRS
Greenfield Development with CQRSDavid Hoerster
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...Amazon Web Services
 

Similar a Everything comes in 3's (20)

AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 Summary
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Cloud C
Cloud CCloud C
Cloud C
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Tombolo
TomboloTombolo
Tombolo
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloud
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS Applications
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Greenfield Development with CQRS
Greenfield Development with CQRSGreenfield Development with CQRS
Greenfield Development with CQRS
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
 

Último

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 

Último (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 

Everything comes in 3's

  • 1. Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
  • 2. Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
  • 3. Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
  • 4. System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
  • 5. Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
  • 6. Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
  • 7. More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
  • 8. Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI) Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
  • 9. Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
  • 10. Serial Execution of Proteomics Search
  • 12. Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
  • 13. Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
  • 14. Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
  • 16. Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
  • 18. Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
  • 19. DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
  • 20. Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
  • 21. Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
  • 22. Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
  • 23. Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
  • 24. Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
  • 26. RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
  • 27. RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
  • 28. Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support My Team David Austin Andrew Brader Weichen Wu Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

Notas del editor

  1. REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.http://www.mpiblast.org/About/Publications