Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Stansted slides-desy

147 visualizaciones

Publicado el

Stansted slides-desy

Publicado en: Datos y análisis
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Stansted slides-desy

  1. 1. PETRAIII/EuXFEL data archiving Sergey Yakubov, Martin Gasthuber ( / DESY-IT London, 23 May, 2019
  2. 2. Page 2| PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019 (National)
  3. 3. Page 3 DESY Campus Hamburg – much more communities Synchrotron radiation source (highest brilliance) VUV & soft-x-ray free-electron laser MPI-SD FLASH PETRA III + X-Ray Free-Electron Laser atomic structure & fs dynamics of complex matter CHyN HARBOR CXNS NanoLab CWS
  4. 4. Page 4 sources of data • 3 active accelerators on-site (all photon science) – Petra III, FLASH and EuXFEL • currently 30 active experimental areas (called beamlines) - operated in parallel • more in preparation • Petra IV (future) – expect 104-5 more data • majority of generated data is analyzed with a few months (cooling) • have two independent copies asap (raw & calibration data) | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019
  5. 5. Page 5 DESY datacenter - resources interacting with ARCHIVER data processing resources before archiving • HPC cluster – 400 nodes, 30,000 cores, large InfiniBand fabric • GPFS – 30 building blocks, >30PB, all InfiniBand connected • BeeGFS - 3PB, InfiniBand connected • LHC computing - Analysis Facility + Tier-2, 1000 nodes, 30, 000 cores • 50-60% more resources outside the datacenter (mostly at experimental stations) current archiving capabilities • dCache - 5 large instances, >50PB capacity, >120 building blocks, Tape gateway • Tape – 2 x SL8500 (15000 Slots), 25 x LTO8, 8xLTO6, >80PB capacity | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019
  6. 6. Page 6 data life cycle as of today - from the cradle to the grave • new archive service connected to ‘Core-FS’ and/or after dCache to fit seamlessly into existing workflow • this scenario will most likely use the full automated (API/CLI) archive system interface | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019
  7. 7. Page 7 site manager & administrative workflows • full networked service allowing vertical and horizontal scaling (obvious) • wide range of authentication methods usable (beside local site ones) – x509, OpenID, eduGAIN, … - more is better • used to ‘authenticate’ and to be usable in ‘ACL’ like authorization settings (the identity or DN) • role based service selection (archive profiles - user->set of roles->set of archive profiles) • delegation model for administration - site admin + group admins (with site admin defined limits/pre-selections) • site data policy and community contract dependent ‘archive profiles’ defining major parameters and limits i.e. QOS defs • wide-area access • http* based - allow platform independent tools and standard firewall configs (i.e. webdav, …) • mobile devices (tablet, phone, …) (tools + protocols) not excluded | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019 integration, setup and control - workflow derived requirements
  8. 8. Page 8 end user workflows - I • individual scientist archiving important work (i.e. publication, partial analysis results, …) – DOI required • key metrics • Single archive size: average 10-100 GB. • Files in archive: average 10,000 • Total archive size per user: 5 TB • Duration: 5-10 years • Ingest rates: 10-100 MB/s (more is better) • encryption: not required, nice to have • browser based interaction (authentication, data transfers, metadata query/ingest) • cli tools usable for data ingest • metadata query • starting from a single string input (like Google search) - interactive/immediate selection response • other methods (i.e. referencing/finding through experiment managing services) used in addition | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019 individual scientist – managing private scientific data (on its own generated and managed)
  9. 9. Page 9 end user workflows - II • beamline (experimental station) specific + experiment specific, medium size and rate • key size parameters • Single archive size: average 5 TB • Files in archive: average 150,000 • Total archive size per beamline: 400 TB, doubles every year • Duration: 10 years • Ingest rates: 1-2GB/s • encryption: no required • 3’rd party copy - ‘gather’ all data from various primary storage systems - controlled from single point • local (to site) data transport should be RDMA based and operate (efficiently) on networks faster than 10Gbs • data encryption in transit not required • API + CLI for seamless automation - i.e. API manifested as Rest-API • CLI on Linux, API should support all platforms (incl. Windows ;-) | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019 beamline manager – mix of automated and experiment specific/manual archive interaction
  10. 10. Page 10 end user workflows - III • large collaboration or site managing and controlling archive operations on behalf of (all experiments) - all automated and large scale • all inherited from previous workflow - except the manual part - all interaction automated • key size parameters • Single archive size: average 400 TB. • Files in archive: average 25,000 • Total archive size per beamline: 10s PB, doubles every year • Duration: 10 years • Ingest rates: 10-100GB/s - for a period of 20-50 min • encryption: not required • bulk recall - planned re-analysis require bulk restore operation with decent rates (50% of ingest rate) (feed the compute engine) | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019 Integrated data archiving for large standardized beamline/facility experiments
  11. 11. Page 11 left over… • life cycle of archive objects (not bound to a single access session) - create, fill (meta)data, close - data becomes immutable, query • archive objects could be related to existing ones - i.e. containing new versions of derived data • archive service should generate and handle DOIs (zenodo) for durable external references • all data access should be ‘stream’ based • no random access (within a file) is required • recalls of pre-selected files out of single archive object • asynchronous notifications on (selectable) conditions (events). Support interaction (external state) with DBs external to archive system • i.e. archive object is saved, verified (as condition) • deployment scenarios • main services and esp. metadata store/query • local on site • cloud (using remote service and storage/handling hardware) • bit stream preservation layer • local only • tiered - local and remote (i.e. remote tape) - remote could be ‘cooperating lab’, public cloud, … • (streaming) protocol to transfer data between tiers should adhere to ‘wide area’ thoughts (standards based) • Billing • any ‘non-local’ deployment requires billing services and methods (obvious) seperated in service and storage costs (at least) • external storage resource - long term predictable costs/contracts preferred (less ‘pay as you go’) • per user and group billing (user may be member of several groups and groups might be nested) • encryption - in all cases is ‘nice to have’ expecting issues with local ‘key management’ services • pre and post en/decryption of data in motion and/or at rest is a valid alternative • (Meta)data formats • no special (known to the archive service) data formats required, thus no format conversions (without user interaction) required • Metadata, needs ‘exportable’ to new/updated instances | PETRAIII/EuXFEL data archiving | Martin Gasthuber / Sergey Yakubov, May 2019 other thoughts, requirements and options