The PRP is a partnership of more than 50 institutions, led by researchers at UC San Diego and UC Berkeley and includes the National Science Foundation, Department of Energy, and multiple research universities in the US and around the world. The PRP builds on the optical backbone of Pacific Wave, a joint project of CENIC and the Pacific Northwest GigaPOP (PNWGP) to create a seamless research platform that encourages collaboration on a broad range of data-intensive fields and projects.
2. ● Stanford/SLAC Cryo-EM
○ Joint initiative between Stanford School of Medicine and SLAC Lab.
○ Led by Wah Chiu (formerly of BCM)
○ Data taking started ~Jan 2018
○ 3 Krio’s (2 with GIFs) and 1 Arctica
■ All with Gatan K2 cameras; upgrades to K3 start next month!
■ Users: NIH U24, NIH P41, Stanford and SLAC
● S2C2: Stanford-SLAC Cryo-EM Center
○ New NIH U24 Award
○ Will add 4+ microscopes within the next 3 years; 2 already on order
● Me:
○ Plan, manage and operate all data management, computation, and software infrastructure for all
Cryo-EM Facilities
○ Day Job: Work with all SLAC science to support their computational and storage requirements
(LSST, Fermi, LCLS I/II, CDMS, ATLAS)
2
Introduction
Stanford/SLAC will provide World Class expertise and training for Cryo-EM
3. 3
Architectural Overview
Similar architecture to LCLS and other Data-centric Scientific Experiments
Onsite
Offsite - User Institutions
(Universities, other labs etc)
Onsite SLAC - Petascale
Data Reduction
Pipeline
Online
Monitoring
~2 Gb/s Fast
feedback
storage
Up to 0.267 GB/s
Detector
(TEM)
Offline
storage
Petascale
HPC
Offline
storage
Terascale
HPC
Fast
Feedback
~ seconds ~ 2 min
N/A
~ days
Pre-processing
Reconstruction
2D + 3D + Refinement
x4
4. ● Just on TEM2
○ 35 active proposals
○ >100 experiments
● Mostly limited by
○ Scope downtime
○ Screening time
○ Managerial efficiencies
● Typical Experiment
○ Single Particle: ~6,000 images, total of 4-8TB
○ Tomography: 10-20 tomograms, total of 1-2TB
4
Activity Thus Far...
Every increasing data rates
5. 5
Technological Tenets of Data Management
Provide users quick feedback on sample and data quality
Data Pipelines &
Data Provenance
eLogBook &
Monitoring/Reporting
Near Real Time Feedback &
Data Analysis
6. 6
User Focused
“Is my sample viable”, “Can I get answers from my time on the ‘scope”
● Provide rough gauge of sample, image and data quality in (near) real-time
● Provide automated processing and feedback on
○ sample previews (remote access)
○ Initial pre-processing:
■ Movie alignment (ie MotionCor)
■ CTF calculations (ie ctffind/gctf)
■ Initial automated particle picking (ie dogpicker/gautomatch)
○ Soon:
■ Initial 2D class averages
■ Initial 3D density map
● Provide data management, computational resources and software support for users
○ GPU resources for relion, cryosparc etc.
○ Storage of metadata (logbook), raw data and initial data products
7. ● Opensource framework for ETL
● Rich python-based pluggable architecture
● Integrated into our LSF, TSDB and GPFS
environment
● Full accounting and reporting of pipeline
● Horizontally scalable; deployed as containers
● Easy to use GUI; flexible CLI
● Each pipeline defined as Graphs (DAG)
7
Pipelines: Managed with Apache Airflow
Don’t reinvent the wheel; add mud-flaps
8. ● Live pre-processing preview sent to dedicated SLACK
● eLogbook
○ Provide centralised platform to access, view, annotate and
process their data
○ Provides management reports etc.
8
eLogBook & Monitoring/Reporting
Reporting for both Users and Management
9. 9
Infrastructure/Resource Predictions
Significant ramp up of (GPU) Compute and Storage
Currently 0-5 years * 10 years *
CPU Compute 0.0 PFLOPS 0.0 PFLOPS 0.0 PFLOPS
GPU 1.0 PFLOPS 10.0 PFLOPS 50.0 PFLOPS
Disk Storage 1 PB 5 PB 10 PB
Tape Storage 0 5 PB $ 10 PB $
Racks 1 6 10
Numbers assume single-particle analysis only. Tomography requirements TBD (assumed similar)
* Preliminary
$ Heavily dependent upon agreed NIH data retention policies
10. ● Funding agencies want to know how much time/effort/storage/compute time something takes; as well as
how many papers can be published
● Experimentalists do not want to become experts in hardware and software
● Horizontally scalable hardware and software solutions required
○ No SPOF; burstability; off-site/cloud
● Hot/Cold/Warm data
○ How big is the active data set? How long do they need the data around for? Do duplicates exist?
○ Use of HSM and/or cloud storage
● Data/Experiment portability
○ Not just rsync of data
○ Provenance of workflow and data: reproducibility of results - containers!
● Experimental metadata
○ microscope parameters + sample prep details
○ Metadata and data catalogue
● Down in the weeds:
○ Inodes! Best move to object stores… software support? 10
Remarks
Notas del editor
2GB per minute PER MICROSCOPE
Ie 0.250 Gb/s or 0.033 GB/s data rates
What is show is the 8 microscopes (4 existing + 4 on new U24 grant)
Data Reduction Pipeline NOT APPLICABLE currently, potential to use ML/AI to veto ‘bad’ images - but likely to happen at Fast Feedback layer.
Fast Feedback and Offline current same tier; will implement as cache layer rather than physically separate layer.
Provide experimenters a sense of how the experiment is going
Data Retention assumed low (maybe 4 months only); most data will be Wah Chiu Lab Group data only.
Data derived from https://docs.google.com/spreadsheets/d/1QCM9x7Q6u_haqeIl4vUc9g2AvP80Kr0yVEnjm1gibfE/edit#gid=0