SlideShare una empresa de Scribd logo
1 de 39
Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack October 6, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems BiologyUniversity of Chicago 1
Cistrack Team (UIC & U. Chicago) ,[object Object]
Jia Chen
Robert GrossmanYunhongGu David Hanley ,[object Object],Xiangjun Liu ,[object Object]
Michal Sabala
Damian Roqueiro
Parantu K Shah
FengTian
Kevin White,[object Object]
Projected sequencing capabilities (world-wide) 2060 Total human population 2019 One of each described species 2031 One of each species ~100M estimate log10 billions of base pairs  2023 One of each species ~10M estimate Kevin White, unpublished
Is Biology a Large Data Science? vs CPUs double approximately every 18 months (Moore’s Law).  Disks double every 12-15 months (Johnson’s Law). Amount of publically available sequence data is doubling approximately every 12 months. 5
IBM joins race for $100 personal genome.
We Have a Problem vs More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it. Large projects build their own infrastructure. Almost all other biologists are on their own. 7
Point of View To do research today… Analytic infrastructure Analytic algorithms & statistical models Data
Part 2What is a Cloud? 9
What is a Cloud? 10 Software as a Service
Is Anything Else a Cloud? 11 Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
Are There Other Types of Clouds? 12 ad targeting  Large Data Cloud Services
What is Virtualization? 13
Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
One Definition Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 15
16 Scale is new.
Elastic, Usage Based Pricing Is New 17 costs the same as 1 computer in a rack for 120 hours 120 computers in  three racks for 1 hour ,[object Object]
Clouds can be used to manage surges in computing.,[object Object]
2004 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science
20 Clouds vs Grids
Hadoop & Sector 21
MalStone B Benchmark 22
Part 3Cistrack 23 www.cistrack.org
Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
Chromatin Developmental Time-Course H3K4me1 	enhancers H3K4me3		promoters & enhancers H3K9Ac		activation H3K9me3		heterochromotin H3K27Ac		activation H3K27me3	repression PolII			transcript. & promoters CBP			HAT- enhancers Total RNA		expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
1. Cistrack Supports Cubes of Data Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP.  Each factor has been studied for 12 different time-points of Drosophila development.
2. ChIP-Seq Data Volumes are Large Cistrack integrates with large data clouds.
3. Continuous Reanalysis is Desirable In general, it is quite labor intensive to reanalyze your existing data with a new algorithm. Cistrack supports VMs that can simplify re-applingCistrackpipelines that have been updated to include a new algorithm.
Cistrack Architecture Cistrack Web Portal & Widgets Cistrack Database Analysis Pipelines & Re-analysis Services CistrackCloud Services Ingestion Services
Part 4Reanalysis 30 Can you repeat an analytic pipeline one year after a post-doc leaves your lab?
Promoters: Use H3K4me3, PolII &RNA to Map Active Genes
Promoters: Use of H3K4me3, PolII & RNA to Map Active Genes
Active Genes  - Solexa Result

Más contenido relacionado

La actualidad más candente

Slide 1
Slide 1Slide 1
Slide 1
butest
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 
Handling High Energy Physics Data using Cloud Computing
Handling High Energy Physics Data using Cloud ComputingHandling High Energy Physics Data using Cloud Computing
Handling High Energy Physics Data using Cloud Computing
Abhishek Dey
 

La actualidad más candente (20)

Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Slide 1
Slide 1Slide 1
Slide 1
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Handling High Energy Physics Data using Cloud Computing
Handling High Energy Physics Data using Cloud ComputingHandling High Energy Physics Data using Cloud Computing
Handling High Energy Physics Data using Cloud Computing
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
DATACUBES: Conquering Space & Time
DATACUBES: Conquering Space & TimeDATACUBES: Conquering Space & Time
DATACUBES: Conquering Space & Time
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
OpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences DataOpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences Data
 

Destacado (6)

Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 

Similar a Bioclouds CAMDA (Robert Grossman) 09-v9p

Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right Science
Chef Software, Inc.
 

Similar a Bioclouds CAMDA (Robert Grossman) 09-v9p (20)

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right Science
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 

Más de Robert Grossman

Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 

Más de Robert Grossman (16)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Bioclouds CAMDA (Robert Grossman) 09-v9p

  • 1. Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack October 6, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems BiologyUniversity of Chicago 1
  • 2.
  • 4.
  • 9.
  • 10. Projected sequencing capabilities (world-wide) 2060 Total human population 2019 One of each described species 2031 One of each species ~100M estimate log10 billions of base pairs 2023 One of each species ~10M estimate Kevin White, unpublished
  • 11. Is Biology a Large Data Science? vs CPUs double approximately every 18 months (Moore’s Law). Disks double every 12-15 months (Johnson’s Law). Amount of publically available sequence data is doubling approximately every 12 months. 5
  • 12. IBM joins race for $100 personal genome.
  • 13. We Have a Problem vs More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it. Large projects build their own infrastructure. Almost all other biologists are on their own. 7
  • 14. Point of View To do research today… Analytic infrastructure Analytic algorithms & statistical models Data
  • 15. Part 2What is a Cloud? 9
  • 16. What is a Cloud? 10 Software as a Service
  • 17. Is Anything Else a Cloud? 11 Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
  • 18. Are There Other Types of Clouds? 12 ad targeting Large Data Cloud Services
  • 20. Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
  • 21. One Definition Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 15
  • 22. 16 Scale is new.
  • 23.
  • 24.
  • 25. 2004 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science
  • 26. 20 Clouds vs Grids
  • 29. Part 3Cistrack 23 www.cistrack.org
  • 30. Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
  • 31. Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters & enhancers H3K9Ac activation H3K9me3 heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript. & promoters CBP HAT- enhancers Total RNA expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
  • 32. 1. Cistrack Supports Cubes of Data Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.
  • 33. 2. ChIP-Seq Data Volumes are Large Cistrack integrates with large data clouds.
  • 34. 3. Continuous Reanalysis is Desirable In general, it is quite labor intensive to reanalyze your existing data with a new algorithm. Cistrack supports VMs that can simplify re-applingCistrackpipelines that have been updated to include a new algorithm.
  • 35. Cistrack Architecture Cistrack Web Portal & Widgets Cistrack Database Analysis Pipelines & Re-analysis Services CistrackCloud Services Ingestion Services
  • 36. Part 4Reanalysis 30 Can you repeat an analytic pipeline one year after a post-doc leaves your lab?
  • 37. Promoters: Use H3K4me3, PolII &RNA to Map Active Genes
  • 38. Promoters: Use of H3K4me3, PolII & RNA to Map Active Genes
  • 39. Active Genes - Solexa Result
  • 40. Basic Idea Replace Cloud VM VM VM At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
  • 41. Raywulf We have designed a cluster (called a Raywolf Cloud) that is optimized to serve as your own private cloud. About $2K/TB. Will be used by the Open Science Data Cloud.
  • 43. Cis-Regulatory Map of the Drosophila Genome (modENCODE) Data Generation Kevin White, U. Chicago (Antibody pipeline, ChIP-chip pipeline) Bing Ren, UCSD (Antibody validation, ChIP-chip pipeline) Robert Grossman U. Illinois (LIMS, data management & analysis) Computational identification of Cis-Regulatory Motifs ManolisKellis, MIT (Motif analysis, ChIP-chip data analysis) Biological validation Jim Posakony, UCSD (Promoters/Enhancers) Steve Russell, Cambridge U. (Insulators/Silencers) Hugo Bellen, Baylor (Element “necessity” validations)
  • 44.
  • 52.