SlideShare una empresa de Scribd logo
1 de 44
The Cutting Edge Can Hurt You
Stories from real-world adopters of next generation
sequencing technology
Christopher Dwan
The Bioteam
Bioteam
• Consultancy, with a software business
• Vendor Neutral, Technology Agnostic
• “Bridge the gap” between high performance computing and life
sciences
• Founded 2003
Shameless Plugs:
• We’re in Booth 113
• Next Generation Sequencing Workshop (yesterday, plus next year)
• http://bioteam.net
• We’re hiring.
cdwan@bioteam.net
Disclaimer
• Most BioTeam clients don’t have 7 figure IT
budgets, Petabyte SANs dedicated datacenters,
and so on.
• Many of these problems:
– Are quite different for the largest Bio-HPC centers
– Simply don’t matter to the nationally funded
projects.
I offer no answers…
cdwan@bioteam.net
Review: 2007 Predictions
• Multi-core commodity processors
– Workstations are already insanely powerful.
• Virtualization on the workstation
– Why port code, when you can make a new machine?
• Reconfigurable computing goes mainstream
– Partnerships, Collaborations, and re-seller agreements.
• Next generation DNA sequencing
– Data tsunami
2007 Predictions - Reviewed
• Multi-core commodity processors
– I touched a 16 core workstation with 10TB of disk. It “just worked.”
• Virtualization on the workstation Everywhere! Including Data!
– I saw a workstation replace 6 legacy OS’s in one shop.
• Reconfigurable computing goes mainstream Seemingly not
yet
– Talk to the folks on the show floor to get strong contradictions.
• Next generation DNA sequencing. Yup.
– Data tsunami
NEXT GENERATION SEQUENCERS
$5M: If you get one of these …
cdwan@bioteam.net
You probably know about these
cdwan@bioteam.net
cdwan@bioteam.net
Next Generation Sequencing
• Costs
– Instrument cost: $5x105 to $106
– Reagent Cost: $3k - $10k
– ~ 1 TB / machine / day
– 4 or 5 vendors
• Cost imbalance with IT
components
– $7k for an experiment
– $3k for a new server
• Opens high-throughput to much
smaller labs.
Next Generation Sequencing
• Naming is annoying:
– “Next gen”, “new gen”, “now gen”
• High Throughput DNA Sequencing
– Helicos, Roche, Illumina, ABI,
Church Lab
– Also other domains: Cofocal
microscopes, mass spec, …
“The old days, which were about two
years ago”
Fundamentals
• Standard facilities questsion:
– Heat / Power / Floor capacity, just as much as ever
• Network:
– Moving 1TB of data from instrument to the next
room, much less to a collaborator
– Still with the sneakernet.
• Security
• Data
• Lab information management
cdwan@bioteam.net
Information Management
• Day 0
– Simply catching the data and not dropping it is a
challenge.
• 3 months
– Postdocs carrying data around on firewire disks
– Data management with post-it notes.
• 6 months
– Instrument vendor updates their software.
– Re-analysis?
• 1 year
– New machine from a different vendor.
Networking / Data Motion
• Data motion can interfere with data acquisition
(go on, ask the instrument vendor)
• Software updates can interfere with attempts to
automate data motion
• Move 1TB of data from lab to network closet
– Old building network, instrument offline for hours
– Small $200 4 port gigabit switch (see “security”)
– Excuse for a building-wide upgrade, 1 year horizon
Security
• Network and IT Security: serious job.
• Labs must propose workable solutions, not
wait for security staff to provide them.
• Common observations:
– Mess with building wide network = security audit
– No solution = system offline.
“Stay out of the news”
Humans are the problem
Lab
Instruments
Tape Backup
Compute
Cluster
One-off Linux /
Windows Machine
RAID / Data
Store
Dev
Workstation
Management:
Throughput? Status?
Schedule? Security?
Lab Staff
Availability, quality,
“did it work?”
IT Staff
Throughput? Status?
Schedule
Bioinformaticians
Access to data,
metadata,
Automation is the solution
Lab
Instruments
Tape Backup
Compute
Cluster
One-off Linux /
Windows Machine
RAID / Data
Store
Dev
Workstation
Management
Machines need to
write web pages
IT Staff
Offer levels of
service, negotiate
directly with
scientists
Bioinformaticians
Access to data,
metadata,
LIMS
WIKILIMS
Wikipedia/ UC Berkeley
cdwan@bioteam.net
Automatic data capture - Ra
Most structured content can be captured and recorded by
programs as it is generated
cdwan@bioteam.net
File data
raid
Meta data
wiki
Wikilims: Next Gen Data Store
Version Differences
cdwan@bioteam.net
Launch an assembly
Launch an assembly on the cluster
WikiLIMS
• Still sold as a custom service
– No long term license
– Full source code access
– Highly customizable
• Variety of customers:
– Navy Medical Research Labs
– Cold Spring Harbor
– Emory University
– National Cancer Institute
– …
• Both 113, we’ll talk your ear off.
This is the semantic web
All updates happening at once
cdwan@bioteam.net
STORAGE AND BACKUPS
“On their way to becoming a sick joke”
cdwan@bioteam.net
Storage
• Storage: Same in 2008 as in 2006
– Unhappy technology tradeoffs
– ‘Exotic’ vendors offer blazing speed and a few features
– ‘Mainstream’ vendors exclusively focused on enterprise
– What I need: Massive scaling, decent speed & grab bag of
enterprise features
• Real World Solution, early 2008:
– 100TB disk, backup, small cluster, plus all infrastructure
– Price range: $225k - $998k
Cut the problem into pieces.
Archive vs. Resequence
• 2007
– Shocking suggestion to delete primary data
– Sanger suggested the MAID (Massive Array of Idle Disks)
– Novartis reported that 97% of their files are never
accessed 3 months after generation.
• 2008
– Instrument vendors deleting large volumes of data inside
the box.
– Less shock, more “data lifecycle”
“New instrument data would be different anyway”
Data Storage
• 1 TB
– $200 @ Savers ($0.20 / GB)
• 24 – 48TB
– Commodity solutions, many vendors ($0.70 /GB)
• 100TB+
– Interesting architectural tradeoffs
– Decision should be based on support expectations
– Below $3 / GB really scares me
• 1 – 2PB
– “Large”
cdwan@bioteam.net
•100+TB SAN
•50+ compute nodes
•One rather warm closet.
Backups are legion
• Archive:
– 1TB Firewire disk - $150
– 800GB LTO4 Tape - $90 (plus a sizable machine)
• Disaster Recovery:
– Failover, redundancy, etc.
– Just buy two of everything.
• Incremental Rollback
– Traditional “backups”
– Daily, weekly, differentials
Talk to Finance people about backups.
Data Ingest
(instruments)
Legacy Storage Architecture
4PB Tape Archive
24TB “hot” disk
For analysis
SGI SMP
Machines
Linux
Cluster
Caching
problem
Workstations
Web / FTP
access
Data Ingest
(instruments)
New Realities Allow Simplification
4PB Tape Backup
1PB “hot” disk
For analysis
SGI SMP
Machines
Linux
Cluster
Workstations
Web / FTP
access
NEW, COOL STUFF
cdwan@bioteam.net
Amazon Web Services
• EC2 for virtualized computing
– The economics are compelling
• One month of serious experimentation:
– $9.00 USD billed to credit card
– Various money making approaches
• Flexible pricing allows reselling & revenue sharing
• Create a EC2 image and add my own fees on top to cover
development and support costs
– As a developer, I don’t need your credit card
• Amazon handles all transactions & billing
Bioteam and Amazon EC2
• This is the grid:
– Every Bioteam consultant independently deployed an EC2 solution in
2008.
• Inquiry
– Since 2004 - “bioinformatics on a cluster”
– Apple, Microsoft CCS, Linux, etc.
– May 1, 2008: Inquiry on Amazon EC2
– CPU Cost to customer: $10 / node day
• Data service: 500GB, constantly updated:
– $1400 yr: downloads, maintenance, and storage
– $17 yr / cost to Bioteam to support a customer
Conclusion
If scientists are wasting a bunch of time on IT,
we’ve got more work to do.
Disturbing Observation
I seem to have presented both a functional
“grid” and an instance of a “semantic web” in
the same talk.
Thank You
• Cambridge Healthtech Institute
– Cindy Crowninshield, Kevin Davies
• Bioteam Customers
– Ed Delong (MIT), Tim Read (NMRC), Yuri Kotliari (NIH),
CSHL
• Bioteam
– Mike Cariaso, Chris Dagdigian, Stan Gloss, Brian
Osborne, Bill Van Etten, Jiesheng Zhang
• Community
– Bioclusters, Sun Grid Engine, Bioinformatics.org
cdwan@bioteam.net
Questions
cdwan@bioteam.net

Más contenido relacionado

La actualidad más candente

EVault Technical DRaaS Guide_Final
EVault Technical DRaaS Guide_FinalEVault Technical DRaaS Guide_Final
EVault Technical DRaaS Guide_Final
Jamie Evans
 
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
In-Memory Computing Summit
 

La actualidad más candente (20)

EVault Technical DRaaS Guide_Final
EVault Technical DRaaS Guide_FinalEVault Technical DRaaS Guide_Final
EVault Technical DRaaS Guide_Final
 
Webinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash MarketWebinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash Market
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
 
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-WarWebinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
 
Optimizing workload deployments to accelerate business outcomes
Optimizing workload deployments to accelerate business outcomes Optimizing workload deployments to accelerate business outcomes
Optimizing workload deployments to accelerate business outcomes
 
Webinar 5-reasons-object-storage.pptx
Webinar 5-reasons-object-storage.pptxWebinar 5-reasons-object-storage.pptx
Webinar 5-reasons-object-storage.pptx
 
SQL Saturday San Diego
SQL Saturday San DiegoSQL Saturday San Diego
SQL Saturday San Diego
 
Deep Dive: What's New in NetBackup Appliances 3.1
Deep Dive: What's New in NetBackup Appliances 3.1Deep Dive: What's New in NetBackup Appliances 3.1
Deep Dive: What's New in NetBackup Appliances 3.1
 
Three Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveThree Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active Archive
 
Running SQL 2005? It’s time to migrate to SQL 2014!
Running SQL 2005? It’s time to migrate to SQL 2014!Running SQL 2005? It’s time to migrate to SQL 2014!
Running SQL 2005? It’s time to migrate to SQL 2014!
 
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
 
Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...
Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...
Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...
 
Examining Technical Best Practices for Veritas and AWS Using a Detailed Refer...
Examining Technical Best Practices for Veritas and AWS Using a Detailed Refer...Examining Technical Best Practices for Veritas and AWS Using a Detailed Refer...
Examining Technical Best Practices for Veritas and AWS Using a Detailed Refer...
 
Deep Dive: a technical insider's view of NetBackup 8.1 and NetBackup Appliances
Deep Dive: a technical insider's view of NetBackup 8.1 and NetBackup AppliancesDeep Dive: a technical insider's view of NetBackup 8.1 and NetBackup Appliances
Deep Dive: a technical insider's view of NetBackup 8.1 and NetBackup Appliances
 
Examining Technical Best Practices for Veritas and Azure Using a Detailed Re...
 Examining Technical Best Practices for Veritas and Azure Using a Detailed Re... Examining Technical Best Practices for Veritas and Azure Using a Detailed Re...
Examining Technical Best Practices for Veritas and Azure Using a Detailed Re...
 
Meet the experts: autoscaling in the cloud - case study Teleticket Service & ...
Meet the experts: autoscaling in the cloud - case study Teleticket Service & ...Meet the experts: autoscaling in the cloud - case study Teleticket Service & ...
Meet the experts: autoscaling in the cloud - case study Teleticket Service & ...
 
Erik Ableson & Vincent Branger: What's best for vdi storage optimisation hard...
Erik Ableson & Vincent Branger: What's best for vdi storage optimisation hard...Erik Ableson & Vincent Branger: What's best for vdi storage optimisation hard...
Erik Ableson & Vincent Branger: What's best for vdi storage optimisation hard...
 
Hms crash planitsummit2016
Hms crash planitsummit2016Hms crash planitsummit2016
Hms crash planitsummit2016
 
Andrey Okhrimets - “Data Lake and Media Asset Management. Challenges and outc...
Andrey Okhrimets - “Data Lake and Media Asset Management. Challenges and outc...Andrey Okhrimets - “Data Lake and Media Asset Management. Challenges and outc...
Andrey Okhrimets - “Data Lake and Media Asset Management. Challenges and outc...
 

Similar a "The Cutting Edge Can Hurt You"

Kscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data PlatformKscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data Platform
Kyle Hailey
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
Kyle Hailey
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
elliando dias
 
Agile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloningAgile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloning
Kyle Hailey
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
George Ang
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data Project
Peak Hosting
 

Similar a "The Cutting Edge Can Hurt You" (20)

Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application Development
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
DevOps, Databases and The Phoenix Project UGF4042 from OOW14DevOps, Databases and The Phoenix Project UGF4042 from OOW14
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
 
Kscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data PlatformKscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data Platform
 
The New Model
The New ModelThe New Model
The New Model
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual Data
 
Agile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloningAgile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloning
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”
 
Data as a Service
Data as a Service Data as a Service
Data as a Service
 
Why 2015 is the Year of Copy Data - What are the requirements?
Why 2015 is the Year of Copy Data - What are the requirements?Why 2015 is the Year of Copy Data - What are the requirements?
Why 2015 is the Year of Copy Data - What are the requirements?
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data Project
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 

Más de Chris Dwan

Más de Chris Dwan (20)

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
 
Response from newport
Response from newportResponse from newport
Response from newport
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
 
Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

"The Cutting Edge Can Hurt You"

  • 1. The Cutting Edge Can Hurt You Stories from real-world adopters of next generation sequencing technology Christopher Dwan The Bioteam
  • 2. Bioteam • Consultancy, with a software business • Vendor Neutral, Technology Agnostic • “Bridge the gap” between high performance computing and life sciences • Founded 2003 Shameless Plugs: • We’re in Booth 113 • Next Generation Sequencing Workshop (yesterday, plus next year) • http://bioteam.net • We’re hiring.
  • 3. cdwan@bioteam.net Disclaimer • Most BioTeam clients don’t have 7 figure IT budgets, Petabyte SANs dedicated datacenters, and so on. • Many of these problems: – Are quite different for the largest Bio-HPC centers – Simply don’t matter to the nationally funded projects.
  • 4. I offer no answers… cdwan@bioteam.net
  • 5. Review: 2007 Predictions • Multi-core commodity processors – Workstations are already insanely powerful. • Virtualization on the workstation – Why port code, when you can make a new machine? • Reconfigurable computing goes mainstream – Partnerships, Collaborations, and re-seller agreements. • Next generation DNA sequencing – Data tsunami
  • 6. 2007 Predictions - Reviewed • Multi-core commodity processors – I touched a 16 core workstation with 10TB of disk. It “just worked.” • Virtualization on the workstation Everywhere! Including Data! – I saw a workstation replace 6 legacy OS’s in one shop. • Reconfigurable computing goes mainstream Seemingly not yet – Talk to the folks on the show floor to get strong contradictions. • Next generation DNA sequencing. Yup. – Data tsunami
  • 8. $5M: If you get one of these … cdwan@bioteam.net
  • 9. You probably know about these cdwan@bioteam.net
  • 10. cdwan@bioteam.net Next Generation Sequencing • Costs – Instrument cost: $5x105 to $106 – Reagent Cost: $3k - $10k – ~ 1 TB / machine / day – 4 or 5 vendors • Cost imbalance with IT components – $7k for an experiment – $3k for a new server • Opens high-throughput to much smaller labs.
  • 11. Next Generation Sequencing • Naming is annoying: – “Next gen”, “new gen”, “now gen” • High Throughput DNA Sequencing – Helicos, Roche, Illumina, ABI, Church Lab – Also other domains: Cofocal microscopes, mass spec, … “The old days, which were about two years ago”
  • 12. Fundamentals • Standard facilities questsion: – Heat / Power / Floor capacity, just as much as ever • Network: – Moving 1TB of data from instrument to the next room, much less to a collaborator – Still with the sneakernet. • Security • Data • Lab information management
  • 14. Information Management • Day 0 – Simply catching the data and not dropping it is a challenge. • 3 months – Postdocs carrying data around on firewire disks – Data management with post-it notes. • 6 months – Instrument vendor updates their software. – Re-analysis? • 1 year – New machine from a different vendor.
  • 15. Networking / Data Motion • Data motion can interfere with data acquisition (go on, ask the instrument vendor) • Software updates can interfere with attempts to automate data motion • Move 1TB of data from lab to network closet – Old building network, instrument offline for hours – Small $200 4 port gigabit switch (see “security”) – Excuse for a building-wide upgrade, 1 year horizon
  • 16. Security • Network and IT Security: serious job. • Labs must propose workable solutions, not wait for security staff to provide them. • Common observations: – Mess with building wide network = security audit – No solution = system offline. “Stay out of the news”
  • 17. Humans are the problem Lab Instruments Tape Backup Compute Cluster One-off Linux / Windows Machine RAID / Data Store Dev Workstation Management: Throughput? Status? Schedule? Security? Lab Staff Availability, quality, “did it work?” IT Staff Throughput? Status? Schedule Bioinformaticians Access to data, metadata,
  • 18. Automation is the solution Lab Instruments Tape Backup Compute Cluster One-off Linux / Windows Machine RAID / Data Store Dev Workstation Management Machines need to write web pages IT Staff Offer levels of service, negotiate directly with scientists Bioinformaticians Access to data, metadata, LIMS
  • 21. cdwan@bioteam.net Automatic data capture - Ra Most structured content can be captured and recorded by programs as it is generated
  • 24. cdwan@bioteam.net Launch an assembly Launch an assembly on the cluster
  • 25. WikiLIMS • Still sold as a custom service – No long term license – Full source code access – Highly customizable • Variety of customers: – Navy Medical Research Labs – Cold Spring Harbor – Emory University – National Cancer Institute – … • Both 113, we’ll talk your ear off. This is the semantic web
  • 26. All updates happening at once cdwan@bioteam.net
  • 27. STORAGE AND BACKUPS “On their way to becoming a sick joke”
  • 28. cdwan@bioteam.net Storage • Storage: Same in 2008 as in 2006 – Unhappy technology tradeoffs – ‘Exotic’ vendors offer blazing speed and a few features – ‘Mainstream’ vendors exclusively focused on enterprise – What I need: Massive scaling, decent speed & grab bag of enterprise features • Real World Solution, early 2008: – 100TB disk, backup, small cluster, plus all infrastructure – Price range: $225k - $998k Cut the problem into pieces.
  • 29. Archive vs. Resequence • 2007 – Shocking suggestion to delete primary data – Sanger suggested the MAID (Massive Array of Idle Disks) – Novartis reported that 97% of their files are never accessed 3 months after generation. • 2008 – Instrument vendors deleting large volumes of data inside the box. – Less shock, more “data lifecycle” “New instrument data would be different anyway”
  • 30. Data Storage • 1 TB – $200 @ Savers ($0.20 / GB) • 24 – 48TB – Commodity solutions, many vendors ($0.70 /GB) • 100TB+ – Interesting architectural tradeoffs – Decision should be based on support expectations – Below $3 / GB really scares me • 1 – 2PB – “Large”
  • 31. cdwan@bioteam.net •100+TB SAN •50+ compute nodes •One rather warm closet.
  • 32. Backups are legion • Archive: – 1TB Firewire disk - $150 – 800GB LTO4 Tape - $90 (plus a sizable machine) • Disaster Recovery: – Failover, redundancy, etc. – Just buy two of everything. • Incremental Rollback – Traditional “backups” – Daily, weekly, differentials Talk to Finance people about backups.
  • 33.
  • 34.
  • 35. Data Ingest (instruments) Legacy Storage Architecture 4PB Tape Archive 24TB “hot” disk For analysis SGI SMP Machines Linux Cluster Caching problem Workstations Web / FTP access
  • 36. Data Ingest (instruments) New Realities Allow Simplification 4PB Tape Backup 1PB “hot” disk For analysis SGI SMP Machines Linux Cluster Workstations Web / FTP access
  • 38. cdwan@bioteam.net Amazon Web Services • EC2 for virtualized computing – The economics are compelling • One month of serious experimentation: – $9.00 USD billed to credit card – Various money making approaches • Flexible pricing allows reselling & revenue sharing • Create a EC2 image and add my own fees on top to cover development and support costs – As a developer, I don’t need your credit card • Amazon handles all transactions & billing
  • 39. Bioteam and Amazon EC2 • This is the grid: – Every Bioteam consultant independently deployed an EC2 solution in 2008. • Inquiry – Since 2004 - “bioinformatics on a cluster” – Apple, Microsoft CCS, Linux, etc. – May 1, 2008: Inquiry on Amazon EC2 – CPU Cost to customer: $10 / node day • Data service: 500GB, constantly updated: – $1400 yr: downloads, maintenance, and storage – $17 yr / cost to Bioteam to support a customer
  • 40. Conclusion If scientists are wasting a bunch of time on IT, we’ve got more work to do.
  • 41. Disturbing Observation I seem to have presented both a functional “grid” and an instance of a “semantic web” in the same talk.
  • 42. Thank You • Cambridge Healthtech Institute – Cindy Crowninshield, Kevin Davies • Bioteam Customers – Ed Delong (MIT), Tim Read (NMRC), Yuri Kotliari (NIH), CSHL • Bioteam – Mike Cariaso, Chris Dagdigian, Stan Gloss, Brian Osborne, Bill Van Etten, Jiesheng Zhang • Community – Bioclusters, Sun Grid Engine, Bioinformatics.org