Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li

Stanford/SLAC Cryo-EM
Computing and Storage
Pacific Research Platform Workshop, UCSC SV
Yee-Ting Li, September 2018

● Stanford/SLAC Cryo-EM
○ Joint initiative between Stanford School of Medicine and SLAC Lab.
○ Led by Wah Chiu (formerly of BCM)
○ Data taking started ~Jan 2018
○ 3 Krio’s (2 with GIFs) and 1 Arctica
■ All with Gatan K2 cameras; upgrades to K3 start next month!
■ Users: NIH U24, NIH P41, Stanford and SLAC
● S2C2: Stanford-SLAC Cryo-EM Center
○ New NIH U24 Award
○ Will add 4+ microscopes within the next 3 years; 2 already on order
● Me:
○ Plan, manage and operate all data management, computation, and software infrastructure for all
Cryo-EM Facilities
○ Day Job: Work with all SLAC science to support their computational and storage requirements
(LSST, Fermi, LCLS I/II, CDMS, ATLAS)
2
Introduction
Stanford/SLAC will provide World Class expertise and training for Cryo-EM

3
Architectural Overview
Similar architecture to LCLS and other Data-centric Scientific Experiments
Onsite
Offsite - User Institutions
(Universities, other labs etc)
Onsite SLAC - Petascale
Data Reduction
Pipeline
Online
Monitoring
~2 Gb/s Fast
feedback
storage
Up to 0.267 GB/s
Detector
(TEM)
Offline
storage
Petascale
HPC
Offline
storage
Terascale
HPC
Fast
Feedback
~ seconds ~ 2 min
N/A
~ days
Pre-processing
Reconstruction
2D + 3D + Refinement
x4

● Just on TEM2
○ 35 active proposals
○ >100 experiments
● Mostly limited by
○ Scope downtime
○ Screening time
○ Managerial efficiencies
● Typical Experiment
○ Single Particle: ~6,000 images, total of 4-8TB
○ Tomography: 10-20 tomograms, total of 1-2TB
4
Activity Thus Far...
Every increasing data rates

5
Technological Tenets of Data Management
Provide users quick feedback on sample and data quality
Data Pipelines &
Data Provenance
eLogBook &
Monitoring/Reporting
Near Real Time Feedback &
Data Analysis

6
User Focused
“Is my sample viable”, “Can I get answers from my time on the ‘scope”
● Provide rough gauge of sample, image and data quality in (near) real-time
● Provide automated processing and feedback on
○ sample previews (remote access)
○ Initial pre-processing:
■ Movie alignment (ie MotionCor)
■ CTF calculations (ie ctffind/gctf)
■ Initial automated particle picking (ie dogpicker/gautomatch)
○ Soon:
■ Initial 2D class averages
■ Initial 3D density map
● Provide data management, computational resources and software support for users
○ GPU resources for relion, cryosparc etc.
○ Storage of metadata (logbook), raw data and initial data products

● Opensource framework for ETL
● Rich python-based pluggable architecture
● Integrated into our LSF, TSDB and GPFS
environment
● Full accounting and reporting of pipeline
● Horizontally scalable; deployed as containers
● Easy to use GUI; flexible CLI
● Each pipeline defined as Graphs (DAG)
7
Pipelines: Managed with Apache Airflow
Don’t reinvent the wheel; add mud-flaps

● Live pre-processing preview sent to dedicated SLACK
● eLogbook
○ Provide centralised platform to access, view, annotate and
process their data
○ Provides management reports etc.
8
eLogBook & Monitoring/Reporting
Reporting for both Users and Management

9
Infrastructure/Resource Predictions
Significant ramp up of (GPU) Compute and Storage
Currently 0-5 years * 10 years *
CPU Compute 0.0 PFLOPS 0.0 PFLOPS 0.0 PFLOPS
GPU 1.0 PFLOPS 10.0 PFLOPS 50.0 PFLOPS
Disk Storage 1 PB 5 PB 10 PB
Tape Storage 0 5 PB $ 10 PB $
Racks 1 6 10
Numbers assume single-particle analysis only. Tomography requirements TBD (assumed similar)
* Preliminary
$ Heavily dependent upon agreed NIH data retention policies

● Funding agencies want to know how much time/effort/storage/compute time something takes; as well as
how many papers can be published
● Experimentalists do not want to become experts in hardware and software
● Horizontally scalable hardware and software solutions required
○ No SPOF; burstability; off-site/cloud
● Hot/Cold/Warm data
○ How big is the active data set? How long do they need the data around for? Do duplicates exist?
○ Use of HSM and/or cloud storage
● Data/Experiment portability
○ Not just rsync of data
○ Provenance of workflow and data: reproducibility of results - containers!
● Experimental metadata
○ microscope parameters + sample prep details
○ Metadata and data catalogue
● Down in the weeds:
○ Inodes! Best move to object stores… software support? 10
Remarks

Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li

Similar a Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li (20)

Más de PacificResearchPlatform

Más de PacificResearchPlatform (14)

Último

Último (20)

Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li

Notas del editor