SlideShare una empresa de Scribd logo
1 de 62
Scientific
Computing at
Fred Hutch
AIRI IT 2018
Slides: Updated April 30th, 2018
2
About Fred Hutch
 Cancer & HIV Research
 3200 Staff in Seattle
 240 Faculty
 $500M Budget (71%
Grants/Contracts)
 5 Scientific Divisions
 1.5M Sqft buildings
3
Compute Infrastructure / HPC
4
HPC 2016/2017: the need for cloud bursting ?
• Git(hub): Manage
code and config
• Containers:
Encapsulate and
version software
• Object Storage:
Cheap, resilient,
scalable, like S3
• Cloud APIs:
Secret Sauce, but
works
…. or cloud native computing ?
AWS Batch, container based computing from Github
Sample 1 -> Genome assembly 1
Sample 2 -> Genome assembly 2
Sample 3 -> Genome assembly 3
…
Sample 378 -> Genome assembly 378
Step 1: De novo genome assembly
Step 2:
Deduplicate
genome
assemblies
Sample 1 -> Gene abundances 1
Sample 2 -> Gene abundances 2
Sample 3 -> Gene abundances 3
…
Sample 378 -> Gene abundances 378
Step 3: Quantify microbial genes
Database
AWS Batch Task
NCBI SRA Database
AWS S3 – FASTQ Cache
Extract FASTQ Assemble Genome
AWS S3 – Genome Storage
Pool Genomes
AWS S3 – Database StorageQuantify Genes
AWS S3 – Final Results
Identify microbial Genes
Identify microbial (bacterial,
viral, etc.) genes which are
expressed in the gut of
people with inflammatory
bowel disease
8
HPC with AWS Batch
Opportunities
 Multitenancy - Not yet designed for
many different users launching jobs
in a single AWS account
 No accounting
 Custom tools needed to be written:
 a wrapper to mitigate
accounting issues
 Tool to facilitate use of named
pipes for streaming
 Store Batch events in database
(otherwise they disappear after
24h)
 A dashboard
Successes
 Great for scaling jobs that use
docker containers and can make
use of S3
 Successful projects :
 Multi-step array job (picard,
kallisto, pizzly)
 Microbiome pipeline
9
End users don’t have AWS console access, so we built
a custom batch console…
Azure Batch
And doAzureParallel:
Create an R
supercomputer on
your Laptop
https://github.com/Azure/doAzureParallel
11
Globus & S3
 SaaS solution
 Tag filtering
 Integrated
Workflows
 S3 creds need
to be kept
server side 
 SSO using
Okta
12
HPC – Native, Hybrid and multi cloud
 First try the AWS Batch with Containers and Object Store
 If that does not work for all go back to traditional HPC workloads
Opportunities
 Select workloads based on “mountstats”, low IO uses traditional HPC
 Did we select a stack that can work in multiple clouds?
AWS Batch -
Containers & S3
Traditional HPC
in Hybrid Cloud
14
Traditional HPC in Cloud
Opportunities
 Perpetuates legacy workflows &
postpones need for change.
 Layer 2 VPNs for high IO ?
 Slurm 17.11 cluster federation will
increase resource utilization
 Spot market and scheduler efficiency
 Path to HIPAA readyness
Successes
 “sbatch -M beagle” : extremely
simple to “be in the cloud”
 VPN with custom Fortinet kernel:
150MB /s nfs to on-prem vs 50MB
 Consistent workflows on-prem and
multi-cloud with data & code
access.
 Manual cloud bursting
 Using slurm power saving api to
automatically shutdown idle nodes
15
2 Projects published on pypi.org
 Ongoing support from Fred Hutch staff
 In production
Slurm Limiter
 Dynamically adjust account limits
 Increase responsiveness and util.
Ldap Caching
 Fast LDAP / Idmap
 Replicates AD, replaces Centrify / SSSD
16
Scientific Pipelines
 We have many and too many are homegrown (shell scripts, make, scons)
 Lack of cloud compatibility, error checking, etc
 Must pipelines be written in a language people know ? If yes: Python
 Does it need to be CWL compatible ? http://www.commonwl.org/ says:
 Tools to be tested at FredHutch: Luigi/SciLuigi, Airflow, Snakemake
 Are tools originated in research outfits sustainable ? Toil, Cromwell ?
17
HPC – a word about GPUs and machine learning
 Gaming GPUs have a 10x price performance advantage for tensorflow
https://github.com/tobigithub/tensorflow-deep-learning/wiki/tf-benchmarks
 But nvidia does not want you to use them:
https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/
 Will we see homegrown GeForce Racks in lab spaces ??
18
HPC Future – Combining Kubernetes and Slurm ?
 Containers on bare medal, virtualization without performance penalty
 Run Docker based and HPC workflows on same infrastructure
Risks
 Nobody has really done it
Opportunities
 Dynamically share infrastructure
 Compatibility with Cloud based container
services
 No need to use bridge-tech (singularity)
Node 1
Node 2
Node 5
Node 3
Node 4
Node 997
Node 998
Node 999
Docker
Kubernetes
LXC / LXD
Slurm
19
Apps, Databases, Devops
20
Collaborate on building faster &
reproducible Scientific Software
Why ? run rbench using R-3.3.0:
 179 secs R compiled on Ubuntu 14.04
 83 secs (54% faster) EasyBuild foss-2016a R
 91 secs (49% faster) EasyBuild intel-2016a R
 86 secs (52% faster) Microsoft R (yes, on linux)
21
So, how does it work?
The sys admin clones a git repos and builds software :
> eb R-3.4.3-foss-2016b-fh2.eb --robot
The user runs :
> ml R/3.4.3-foss-2016b-fh2
> which R
/app/easybuild/software/R/3.4.3-foss-2016b-fh2/bin/R
> R --version
R version 3.4.3 (2017-11-30) -- "Kite-Eating Tree"
22
Shouldn’t you be using docker, doh ?
Off course you can do this in your Docker file:
RUN apt-get -y install r-base
RUN Rscript -e "install.packages('yhatr')"
RUN Rscript -e "install.packages('ggplot2')"
RUN Rscript -e "install.packages('plyr')"
RUN Rscript -e "install.packages('reshape2')"
And then publish the docker container for reproducibility
23
what if he want to use a specific
version of R or a ….
specific version of an R package? Compile ? BlackBox?
24
What do others say ?
James Cuff *):
 “So what about back to rolling
your own? ……To be clear this is
R on top of python on top of
Tensorflow. It’s a deep stack.”
 AI tools: Significantly more
research will be needed to go ….
…..techniques aren’t documented
clearly with example code that
is also fully reproducible.
*) former Assistant Dean and Distinguished Engineer for Research Computing at Harvard
https://www.nextplatform.com/2018/04/09/deep-learning-in-r-documentation-drives-algorithms/
https://www.nextplatform.com/2018/04/18/containing-the-complexity-of-the-long-tail/
25
EB = reducing the chaos
Recipes are Python code (e.g. lists) with strict
versioning for libs and packages for reproducibility
https://github.com/FredHutch/easybuild-life-sciences/tree/master/scripts
use easy_update.py to update R/Python packages to the latest and then freeze
or https://github.com/FredHutch/easybuild-life-sciences/
26
EB @ Fred Hutch
We want to :
 Build on multiple OS versions
(Ubuntu 14.04/16.04/18.04)
 Use the same process for building
in cloud native and in traditional
environments
 Use docker / singularity and well as
/app folder
https://github.com/FredHutch/ls2
We can share :
27
LS2 on Docker
Hub
Giant R container: https://hub.docker.com/r/fredhutch/ls2_r/
Python container: https://hub.docker.com/r/fredhutch/ls2_python
28
EB @ Fred Hutch
Other tools, sometimes complimentary:
See https://fosdem.org/2018/schedule/event/installing_software_for_scientists/
29
DB4Sci – lightweight DBaaS
Motivation
 Users requested too many
different database setups
(databases, instances,
versions, performance, etc)
 Enterprise setup with iscsi
storage had performance
issues with HPC
 Better backup to cloud
needed.
30
DB4Sci – lightweight DBaaS
Architecture
 “Difference of opinion” :
can databases run in
containers
 Stack: NVMe, ZFS, Docker,
Flask, AD
 Install from Github
31
DB4Sci – lightweight DBaaS
Features
 Encryption at Rest
MySQL & Postgres
 Cloud Backup
 AD Integration
 HPC bullet proof
 In Production
 Next: Self service
restore into new
DB
Download: http://db4sci.org
Get involved: https://github.com/FredHutch/DB4SCI
32
GPU databases – performance at a new level
technology runtime (sec) factor slower
MapD & 8 x Nvidia Pascal Titan X 0.109 1
AWS Redshift on 6 x ds2.8xlarge 1.905 17
Google BigQuery 2 18
Postgres & 1 x 4 core, 16GB, ssd 205 1881
http://tech.marksblogg.com
 Runs NY Taxi dataset on many technologies
 Benchmarks documented fully reproducible
leaders
 MapD.com
 Brytlyt.com
 Continuum Analytics
(GPU data frame)
33
Metrics and Monitoring with
Prometheus and Grafana
34
Current Metrics & Monitoring Solutions
TelemetryTOC Nodewatch
Custom In-House Tools
Prometheus + Grafana
35
What is Prometheus?
 Prometheus is a popular open-source time series database, monitoring and
alerting system from the Cloud Native Computing Foundation (CNCF).
 Very rich ecosystem of add-ons, exporters and SDKs.
 Projects adopting “Cloud Native” approach have built-in support to expose
metrics to Prometheus.
 Uses a “Pull” model by default but a “Push” gateway is available when
required.
 Very high performance (v2.0+); a single powerful server can ingest up to
800,000 samples per second.
 Multi-dimensional data model with time series data identified by metric name
and key/value pairs.
 Flexible query language PromQL
36
Prometheus
Alert ManagerBlackBox Exporter
PushGateway
Grafana
SNMP
Exporter
post
HTTP, HTTPS,
DNS, TCP, and
ICMP
Dashboards
Network and
Storage Devices
pull pull
Workstation agents
Job results, custom
Node_exporter
Wmi_exporter
pull
pull
pull
probe
External checks
scripts
Custom
Metrics
Alert
write
files
post
PromQL
Prometheus Solution Architecture
37
Self-Service Metrics Gathering and Dashboards
Prometheus
Active Directory
Grafana
1
GitHub
*.yml
Grafana_Editors
Group
Install exporters
on systems
2
Commit new or
updated config file
3
Incorporate new
targets
4
Collect metrics
from systems
6
Pull metrics
from Prometheus
5
Login and
create dashboard,
panel or alert
srv3
systems
Authentication &
authorization
Authorized Users
- targets:
- srv1.fhcrc.org:9100
- srv2.fhcrc.org:9100
- srv3.fhcrc.org:9182
labels:
app: abc
owner: mydept
View Only Users
7
38
SAS Metering Agenthttps://github.com/FredHutch/sas-metering-client
 Single binary with no dependencies
 Runs as a Windows service
 Every minute checks to see if SAS Desktop is running
 POSTs results (0 or 1) to the Prometheus push gateway
 Prometheus push gateway is open to the Internet so workstations can report in from
anywhere
 Deployed to SAS workstations (135) via SCCM
 Light resource footprint:
39
SAS Desktop Metering Dashboard
40
Mirroring 1.4PB of data to S3 with Rsync and ObjectiveFS
Initial mirror took 60 days, but was throttled to prevent overwhelming our Firewall.
Started
Mirroring
First Pass
Mirroring
Complete
S3 Standard to
IA Migration
41
On-Prem vs Cloud HPC Metrics Core and Node Utilization
42
HPC Service SLA We don’t always meet our CPU core SLA
43
HPC – CPU Cores Per Lab Gizmo + Beagle Clusters Combined
44
EC2 Instance Types Using 32 Different Instance Types!
Transitioning
from C4 to C5
Instances
45
Keeping an eye on shared (abused) interactive nodes
System Out of RAM?Who’s hogging the CPU?
Who’s hogging the RAM?
Interactive Nodes
 3 systems
 56 CPU cores
 376GB RAM
46
Automated Pipelines
Start Stop
F(x)
F(y)
 Scientific Workflows
 Application Deployment
47
Workflow
 User prepares data and job configuration
locally
 Uses command line “mqsubmit” tool to
submit jobs to the pipeline
 When complete, the user receives an
email with a link to download the results
Notes
 Proteomics mostly uses Windows platforms
 Not so easy to integrate in a Linux shop but
through a Cloud API it’s better.
 Custom pipeline, runs windows jobs from familiar
command line interface on Linux
MaxQuant Proteomics Pipeline
$ mqsubmit --mqconfig mqpar.xml --jobname job01 --department scicomp --email me@fhcrc.org
Example Job Submission:
https://github.com/FredHutch/maxquant-pipeline
48
MaxQuant Proteomics Pipeline
“I am again extremely impressed at the speed of your cloud setup! Less than 12 hours to
complete the search, and it would have easily been more than 4 days with our old setup!”
-- MaxQuant Pipeline User
91% Utilization of
128 CPU cores
49
Shiny Web Application Pipelines
Users
2 3
5
86
9
7
Git
GitHub
CircleCI
Test
Docker Hub
Rancher
Container Orchestration
1 4
Shiny App
commit
push trigger
build/test
pull access
?
Good
 After initial setup, researchers can update and re-deploy applications themselves.
 They only need to contact us if the pipeline breaks.
Bad
 Building R packages and dependencies can take a long time to compile; as long
as 40 minutes in some cases.
 Solution: use a base image with all or the most of the dependencies cooked in.
50
Statistics
 420 Repositories
 148 Users
 81 Teams
Private
268
Public
152
https://github.com/FredHutch
Git & GitHub skills are essential for developers, IT
Professionals and even Researchers now.
51
Data Management & Storage
52
53
Archiving ?
 Is there a demand ?
 Are users collecting
useful metadata ?
 How many PB do
you need to be
concerned ?
1
2
Ingredients (also used for storage charge backs)
 pwalk - https://github.com/fizwit/filesystem-reporting-tools
a multi-threaded file system metadata collector
 slurm-pwalk - https://github.com/FredHutch/slurm-pwalk
parallel pwalk crawler execution and import to PostgreSQL
 storage-hotspots - https://github.com/FredHutch/storage-hotspots
a Flask-App that helps finding large folders in the database and
triggers an archiving workflow
 DB4SCI - http://db4sci.org (optional)
High performance database platform
OR better, just use
55
Data Storage News – Backup / DR in cloud
 ObjectiveFS Backup and DR moved into production in Oct 2017, moved 1PB/month
 Posix FS & Linux Scaleout NAS on S3 (think cloud gluster)
 Parallel rsync: ls -a1 | tail -n +3 | parallel -u --progress rsync -avzP {} /target
Cautions
 Staff needs familiarity with rsync and
monitoring / logging
 Avoid more than 500TB / folder
 Only 90% in S3 IAS, 10% in S3
 Limit your retention period for backups
Opportunities
 One solution for Backup / DR
 $8 per TB / month at 50% compr.
 300+ MB/s per node
 Re-using backup / DR copy for fast
“RO” data access.
+ =
56
Data Storage Landscape at Fred Hutch today
57
Systems are presented uniformly via DFS/AutoFS
SMB / CIFS
Namespace
NFS / Posix
Namespace
Does not
work well
with posix
symlinks –
Windows
wonders:
where is
/folder ?
58
We need a samba server to make all symlinks work
59
Well, if we don’t even fully use the NAS … can we go all the way and use OSS cheaply
60
A word about BeeGFS
 In Production as Scratch FS since 2014
 100% uptime (with some cheating)
 Currently ca 400TB capacity
 1000 small files/sec (NetApp: 800,
Isilon X: 280, Avere: 300)
 Infinitely scalable through distributed
metadata infrastructure
 Open source, HA and change logs
Risks
 Config defaults to XFS
instead of ZFS, ZFS
less widely tested
 No vendor phone
home system
Links
 Configuration published
https://github.com/FredHutch/chromium-zfs-beegfs
 scratch-dna benchmark in Python, C and GO:
https://github.com/FredHutch/sc-benchmark
 A Samba server joined to AD, in production:
https://github.com/FredHutch/sc-howto/
61
StorOne
 Local SSD/Disk used by
KVM VM
 High performance
 Made by storage industry
veterans
NyRiad
 Uses GPU to create large
erasure coded pool
 Linux Kernel Module creates
one large block device
 Build for largest radio telescope
Fun question: Is RAID still possible with 14TB drives
?
 And what if you think the answer is no ?
62
Thank you !
Dirk Petersen petersen at fredhutch.org
Robert McDermott rmcdermo at fredhutch.org

Más contenido relacionado

La actualidad más candente

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph Community
 
Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance Red_Hat_Storage
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataRob Gardner
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Community
 
Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64 Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64 Ganesh Raju
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red_Hat_Storage
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Taking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesTaking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesChristopher Bradford
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!DataWorks Summit
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearnJiang Jun
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Ceph at salesforce ceph day external presentation
Ceph at salesforce   ceph day external presentationCeph at salesforce   ceph day external presentation
Ceph at salesforce ceph day external presentationSameer Tiwari
 

La actualidad más candente (20)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
 
Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64 Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Taking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesTaking Your Database Global with Kubernetes
Taking Your Database Global with Kubernetes
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Ceph at salesforce ceph day external presentation
Ceph at salesforce   ceph day external presentationCeph at salesforce   ceph day external presentation
Ceph at salesforce ceph day external presentation
 

Similar a Scientific Computing @ Fred Hutch

OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
IAU workshop 2018 day one
IAU workshop 2018 day oneIAU workshop 2018 day one
IAU workshop 2018 day oneWalid Shaari
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
 
Oscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby projectOscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby projectPatrick Chanezon
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...Edge AI and Vision Alliance
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...Databricks
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes John Archer
 
Coscup2018 itri android-in-cloud
Coscup2018 itri android-in-cloudCoscup2018 itri android-in-cloud
Coscup2018 itri android-in-cloudTian-Jian Wu
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackQAware GmbH
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17Mario-Leander Reimer
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
 
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Walid Shaari
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
 

Similar a Scientific Computing @ Fred Hutch (20)

OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
IAU workshop 2018 day one
IAU workshop 2018 day oneIAU workshop 2018 day one
IAU workshop 2018 day one
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
 
Oscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby projectOscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby project
 
Cloud computing: highlights
Cloud computing: highlightsCloud computing: highlights
Cloud computing: highlights
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 
Coscup2018 itri android-in-cloud
Coscup2018 itri android-in-cloudCoscup2018 itri android-in-cloud
Coscup2018 itri android-in-cloud
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stack
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Scientific Computing @ Fred Hutch

  • 1. Scientific Computing at Fred Hutch AIRI IT 2018 Slides: Updated April 30th, 2018
  • 2. 2 About Fred Hutch  Cancer & HIV Research  3200 Staff in Seattle  240 Faculty  $500M Budget (71% Grants/Contracts)  5 Scientific Divisions  1.5M Sqft buildings
  • 4. 4 HPC 2016/2017: the need for cloud bursting ?
  • 5. • Git(hub): Manage code and config • Containers: Encapsulate and version software • Object Storage: Cheap, resilient, scalable, like S3 • Cloud APIs: Secret Sauce, but works …. or cloud native computing ?
  • 6. AWS Batch, container based computing from Github
  • 7. Sample 1 -> Genome assembly 1 Sample 2 -> Genome assembly 2 Sample 3 -> Genome assembly 3 … Sample 378 -> Genome assembly 378 Step 1: De novo genome assembly Step 2: Deduplicate genome assemblies Sample 1 -> Gene abundances 1 Sample 2 -> Gene abundances 2 Sample 3 -> Gene abundances 3 … Sample 378 -> Gene abundances 378 Step 3: Quantify microbial genes Database AWS Batch Task NCBI SRA Database AWS S3 – FASTQ Cache Extract FASTQ Assemble Genome AWS S3 – Genome Storage Pool Genomes AWS S3 – Database StorageQuantify Genes AWS S3 – Final Results Identify microbial Genes Identify microbial (bacterial, viral, etc.) genes which are expressed in the gut of people with inflammatory bowel disease
  • 8. 8 HPC with AWS Batch Opportunities  Multitenancy - Not yet designed for many different users launching jobs in a single AWS account  No accounting  Custom tools needed to be written:  a wrapper to mitigate accounting issues  Tool to facilitate use of named pipes for streaming  Store Batch events in database (otherwise they disappear after 24h)  A dashboard Successes  Great for scaling jobs that use docker containers and can make use of S3  Successful projects :  Multi-step array job (picard, kallisto, pizzly)  Microbiome pipeline
  • 9. 9 End users don’t have AWS console access, so we built a custom batch console…
  • 10. Azure Batch And doAzureParallel: Create an R supercomputer on your Laptop https://github.com/Azure/doAzureParallel
  • 11. 11 Globus & S3  SaaS solution  Tag filtering  Integrated Workflows  S3 creds need to be kept server side   SSO using Okta
  • 12. 12 HPC – Native, Hybrid and multi cloud  First try the AWS Batch with Containers and Object Store  If that does not work for all go back to traditional HPC workloads Opportunities  Select workloads based on “mountstats”, low IO uses traditional HPC  Did we select a stack that can work in multiple clouds? AWS Batch - Containers & S3 Traditional HPC in Hybrid Cloud
  • 13.
  • 14. 14 Traditional HPC in Cloud Opportunities  Perpetuates legacy workflows & postpones need for change.  Layer 2 VPNs for high IO ?  Slurm 17.11 cluster federation will increase resource utilization  Spot market and scheduler efficiency  Path to HIPAA readyness Successes  “sbatch -M beagle” : extremely simple to “be in the cloud”  VPN with custom Fortinet kernel: 150MB /s nfs to on-prem vs 50MB  Consistent workflows on-prem and multi-cloud with data & code access.  Manual cloud bursting  Using slurm power saving api to automatically shutdown idle nodes
  • 15. 15 2 Projects published on pypi.org  Ongoing support from Fred Hutch staff  In production Slurm Limiter  Dynamically adjust account limits  Increase responsiveness and util. Ldap Caching  Fast LDAP / Idmap  Replicates AD, replaces Centrify / SSSD
  • 16. 16 Scientific Pipelines  We have many and too many are homegrown (shell scripts, make, scons)  Lack of cloud compatibility, error checking, etc  Must pipelines be written in a language people know ? If yes: Python  Does it need to be CWL compatible ? http://www.commonwl.org/ says:  Tools to be tested at FredHutch: Luigi/SciLuigi, Airflow, Snakemake  Are tools originated in research outfits sustainable ? Toil, Cromwell ?
  • 17. 17 HPC – a word about GPUs and machine learning  Gaming GPUs have a 10x price performance advantage for tensorflow https://github.com/tobigithub/tensorflow-deep-learning/wiki/tf-benchmarks  But nvidia does not want you to use them: https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/  Will we see homegrown GeForce Racks in lab spaces ??
  • 18. 18 HPC Future – Combining Kubernetes and Slurm ?  Containers on bare medal, virtualization without performance penalty  Run Docker based and HPC workflows on same infrastructure Risks  Nobody has really done it Opportunities  Dynamically share infrastructure  Compatibility with Cloud based container services  No need to use bridge-tech (singularity) Node 1 Node 2 Node 5 Node 3 Node 4 Node 997 Node 998 Node 999 Docker Kubernetes LXC / LXD Slurm
  • 20. 20 Collaborate on building faster & reproducible Scientific Software Why ? run rbench using R-3.3.0:  179 secs R compiled on Ubuntu 14.04  83 secs (54% faster) EasyBuild foss-2016a R  91 secs (49% faster) EasyBuild intel-2016a R  86 secs (52% faster) Microsoft R (yes, on linux)
  • 21. 21 So, how does it work? The sys admin clones a git repos and builds software : > eb R-3.4.3-foss-2016b-fh2.eb --robot The user runs : > ml R/3.4.3-foss-2016b-fh2 > which R /app/easybuild/software/R/3.4.3-foss-2016b-fh2/bin/R > R --version R version 3.4.3 (2017-11-30) -- "Kite-Eating Tree"
  • 22. 22 Shouldn’t you be using docker, doh ? Off course you can do this in your Docker file: RUN apt-get -y install r-base RUN Rscript -e "install.packages('yhatr')" RUN Rscript -e "install.packages('ggplot2')" RUN Rscript -e "install.packages('plyr')" RUN Rscript -e "install.packages('reshape2')" And then publish the docker container for reproducibility
  • 23. 23 what if he want to use a specific version of R or a …. specific version of an R package? Compile ? BlackBox?
  • 24. 24 What do others say ? James Cuff *):  “So what about back to rolling your own? ……To be clear this is R on top of python on top of Tensorflow. It’s a deep stack.”  AI tools: Significantly more research will be needed to go …. …..techniques aren’t documented clearly with example code that is also fully reproducible. *) former Assistant Dean and Distinguished Engineer for Research Computing at Harvard https://www.nextplatform.com/2018/04/09/deep-learning-in-r-documentation-drives-algorithms/ https://www.nextplatform.com/2018/04/18/containing-the-complexity-of-the-long-tail/
  • 25. 25 EB = reducing the chaos Recipes are Python code (e.g. lists) with strict versioning for libs and packages for reproducibility https://github.com/FredHutch/easybuild-life-sciences/tree/master/scripts use easy_update.py to update R/Python packages to the latest and then freeze or https://github.com/FredHutch/easybuild-life-sciences/
  • 26. 26 EB @ Fred Hutch We want to :  Build on multiple OS versions (Ubuntu 14.04/16.04/18.04)  Use the same process for building in cloud native and in traditional environments  Use docker / singularity and well as /app folder https://github.com/FredHutch/ls2 We can share :
  • 27. 27 LS2 on Docker Hub Giant R container: https://hub.docker.com/r/fredhutch/ls2_r/ Python container: https://hub.docker.com/r/fredhutch/ls2_python
  • 28. 28 EB @ Fred Hutch Other tools, sometimes complimentary: See https://fosdem.org/2018/schedule/event/installing_software_for_scientists/
  • 29. 29 DB4Sci – lightweight DBaaS Motivation  Users requested too many different database setups (databases, instances, versions, performance, etc)  Enterprise setup with iscsi storage had performance issues with HPC  Better backup to cloud needed.
  • 30. 30 DB4Sci – lightweight DBaaS Architecture  “Difference of opinion” : can databases run in containers  Stack: NVMe, ZFS, Docker, Flask, AD  Install from Github
  • 31. 31 DB4Sci – lightweight DBaaS Features  Encryption at Rest MySQL & Postgres  Cloud Backup  AD Integration  HPC bullet proof  In Production  Next: Self service restore into new DB Download: http://db4sci.org Get involved: https://github.com/FredHutch/DB4SCI
  • 32. 32 GPU databases – performance at a new level technology runtime (sec) factor slower MapD & 8 x Nvidia Pascal Titan X 0.109 1 AWS Redshift on 6 x ds2.8xlarge 1.905 17 Google BigQuery 2 18 Postgres & 1 x 4 core, 16GB, ssd 205 1881 http://tech.marksblogg.com  Runs NY Taxi dataset on many technologies  Benchmarks documented fully reproducible leaders  MapD.com  Brytlyt.com  Continuum Analytics (GPU data frame)
  • 33. 33 Metrics and Monitoring with Prometheus and Grafana
  • 34. 34 Current Metrics & Monitoring Solutions TelemetryTOC Nodewatch Custom In-House Tools Prometheus + Grafana
  • 35. 35 What is Prometheus?  Prometheus is a popular open-source time series database, monitoring and alerting system from the Cloud Native Computing Foundation (CNCF).  Very rich ecosystem of add-ons, exporters and SDKs.  Projects adopting “Cloud Native” approach have built-in support to expose metrics to Prometheus.  Uses a “Pull” model by default but a “Push” gateway is available when required.  Very high performance (v2.0+); a single powerful server can ingest up to 800,000 samples per second.  Multi-dimensional data model with time series data identified by metric name and key/value pairs.  Flexible query language PromQL
  • 36. 36 Prometheus Alert ManagerBlackBox Exporter PushGateway Grafana SNMP Exporter post HTTP, HTTPS, DNS, TCP, and ICMP Dashboards Network and Storage Devices pull pull Workstation agents Job results, custom Node_exporter Wmi_exporter pull pull pull probe External checks scripts Custom Metrics Alert write files post PromQL Prometheus Solution Architecture
  • 37. 37 Self-Service Metrics Gathering and Dashboards Prometheus Active Directory Grafana 1 GitHub *.yml Grafana_Editors Group Install exporters on systems 2 Commit new or updated config file 3 Incorporate new targets 4 Collect metrics from systems 6 Pull metrics from Prometheus 5 Login and create dashboard, panel or alert srv3 systems Authentication & authorization Authorized Users - targets: - srv1.fhcrc.org:9100 - srv2.fhcrc.org:9100 - srv3.fhcrc.org:9182 labels: app: abc owner: mydept View Only Users 7
  • 38. 38 SAS Metering Agenthttps://github.com/FredHutch/sas-metering-client  Single binary with no dependencies  Runs as a Windows service  Every minute checks to see if SAS Desktop is running  POSTs results (0 or 1) to the Prometheus push gateway  Prometheus push gateway is open to the Internet so workstations can report in from anywhere  Deployed to SAS workstations (135) via SCCM  Light resource footprint:
  • 40. 40 Mirroring 1.4PB of data to S3 with Rsync and ObjectiveFS Initial mirror took 60 days, but was throttled to prevent overwhelming our Firewall. Started Mirroring First Pass Mirroring Complete S3 Standard to IA Migration
  • 41. 41 On-Prem vs Cloud HPC Metrics Core and Node Utilization
  • 42. 42 HPC Service SLA We don’t always meet our CPU core SLA
  • 43. 43 HPC – CPU Cores Per Lab Gizmo + Beagle Clusters Combined
  • 44. 44 EC2 Instance Types Using 32 Different Instance Types! Transitioning from C4 to C5 Instances
  • 45. 45 Keeping an eye on shared (abused) interactive nodes System Out of RAM?Who’s hogging the CPU? Who’s hogging the RAM? Interactive Nodes  3 systems  56 CPU cores  376GB RAM
  • 46. 46 Automated Pipelines Start Stop F(x) F(y)  Scientific Workflows  Application Deployment
  • 47. 47 Workflow  User prepares data and job configuration locally  Uses command line “mqsubmit” tool to submit jobs to the pipeline  When complete, the user receives an email with a link to download the results Notes  Proteomics mostly uses Windows platforms  Not so easy to integrate in a Linux shop but through a Cloud API it’s better.  Custom pipeline, runs windows jobs from familiar command line interface on Linux MaxQuant Proteomics Pipeline $ mqsubmit --mqconfig mqpar.xml --jobname job01 --department scicomp --email me@fhcrc.org Example Job Submission: https://github.com/FredHutch/maxquant-pipeline
  • 48. 48 MaxQuant Proteomics Pipeline “I am again extremely impressed at the speed of your cloud setup! Less than 12 hours to complete the search, and it would have easily been more than 4 days with our old setup!” -- MaxQuant Pipeline User 91% Utilization of 128 CPU cores
  • 49. 49 Shiny Web Application Pipelines Users 2 3 5 86 9 7 Git GitHub CircleCI Test Docker Hub Rancher Container Orchestration 1 4 Shiny App commit push trigger build/test pull access ? Good  After initial setup, researchers can update and re-deploy applications themselves.  They only need to contact us if the pipeline breaks. Bad  Building R packages and dependencies can take a long time to compile; as long as 40 minutes in some cases.  Solution: use a base image with all or the most of the dependencies cooked in.
  • 50. 50 Statistics  420 Repositories  148 Users  81 Teams Private 268 Public 152 https://github.com/FredHutch Git & GitHub skills are essential for developers, IT Professionals and even Researchers now.
  • 52. 52
  • 53. 53 Archiving ?  Is there a demand ?  Are users collecting useful metadata ?  How many PB do you need to be concerned ? 1 2
  • 54. Ingredients (also used for storage charge backs)  pwalk - https://github.com/fizwit/filesystem-reporting-tools a multi-threaded file system metadata collector  slurm-pwalk - https://github.com/FredHutch/slurm-pwalk parallel pwalk crawler execution and import to PostgreSQL  storage-hotspots - https://github.com/FredHutch/storage-hotspots a Flask-App that helps finding large folders in the database and triggers an archiving workflow  DB4SCI - http://db4sci.org (optional) High performance database platform OR better, just use
  • 55. 55 Data Storage News – Backup / DR in cloud  ObjectiveFS Backup and DR moved into production in Oct 2017, moved 1PB/month  Posix FS & Linux Scaleout NAS on S3 (think cloud gluster)  Parallel rsync: ls -a1 | tail -n +3 | parallel -u --progress rsync -avzP {} /target Cautions  Staff needs familiarity with rsync and monitoring / logging  Avoid more than 500TB / folder  Only 90% in S3 IAS, 10% in S3  Limit your retention period for backups Opportunities  One solution for Backup / DR  $8 per TB / month at 50% compr.  300+ MB/s per node  Re-using backup / DR copy for fast “RO” data access. + =
  • 56. 56 Data Storage Landscape at Fred Hutch today
  • 57. 57 Systems are presented uniformly via DFS/AutoFS SMB / CIFS Namespace NFS / Posix Namespace Does not work well with posix symlinks – Windows wonders: where is /folder ?
  • 58. 58 We need a samba server to make all symlinks work
  • 59. 59 Well, if we don’t even fully use the NAS … can we go all the way and use OSS cheaply
  • 60. 60 A word about BeeGFS  In Production as Scratch FS since 2014  100% uptime (with some cheating)  Currently ca 400TB capacity  1000 small files/sec (NetApp: 800, Isilon X: 280, Avere: 300)  Infinitely scalable through distributed metadata infrastructure  Open source, HA and change logs Risks  Config defaults to XFS instead of ZFS, ZFS less widely tested  No vendor phone home system Links  Configuration published https://github.com/FredHutch/chromium-zfs-beegfs  scratch-dna benchmark in Python, C and GO: https://github.com/FredHutch/sc-benchmark  A Samba server joined to AD, in production: https://github.com/FredHutch/sc-howto/
  • 61. 61 StorOne  Local SSD/Disk used by KVM VM  High performance  Made by storage industry veterans NyRiad  Uses GPU to create large erasure coded pool  Linux Kernel Module creates one large block device  Build for largest radio telescope Fun question: Is RAID still possible with 14TB drives ?  And what if you think the answer is no ?
  • 62. 62 Thank you ! Dirk Petersen petersen at fredhutch.org Robert McDermott rmcdermo at fredhutch.org