SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
exascaleproject.org
ExaLearn Overview
ECP’s Co-design Center for Exascale Machine Learning
Technologies
Project PI: Francis J. Alexander, BNL
Partner PIs and Institutions:
▪ Ian Foster, ANL
▪ Peter Nugent, LBNL
▪ Brian Van Essen, LLNL
▪ Aric Hagberg, LANL
▪ David Womble, ORNL
▪ James A. Ang, PNNL
▪ Michael Wolf, SNL
Place: HPC User Forum, Santa Fe, NM
Date: April 2, 2019
2
Co-design is a Multidisciplinary Collaborative Process
• In Co-design, HPC systems are developed with application software influencing
hardware design trade-offs while also recognizing that applications and
supporting software must be developed concurrently with hardware changes.
• Under the status quo, HPC systems are developed without application input, and
“advanced architectures” have a reputation of being “hard to program.”
• Co-design encourages all parties, ranging from the system designers, computer
architects, application software and tools developers, facilities, etc., to jointly
design and optimize the system specifications via open communication and
collaboration.
3
ECP’s Portfolio of Co-design Projects
• CODAR: Co-design center for Online Data Analysis and Reduction
• COPA: Co-design Center for Particle Applications
• AMREX: Block-structured adaptive mesh refinement (AMR) Co-design Center
• CEED: Center for Efficient Exascale Discretizations (CEED)
• ExaGraph: Combinatorial Methods for Enabling Exascale Applications
• ExaLearn: Co-design Center for Exascale Machine Learning Technologies
4
Three Overarching ExaLearn Co-design Questions
• What are the best learning algorithms and methods for different application
classes in terms of speed, accuracy, and resource requirements? How can we
implement these algorithms to achieve scalability and performance portability?
• What are the trade-offs in learning accuracy, resource needs, and overall
application performance between using various learning methods? How do these
trade-offs vary with exascale hardware and software choices?
• How do we effectively orchestrate learning to reduce associated overheads?
How can exascale hardware and software help with this orchestration?
5
Relationships between Machine Learning and HPC
• HPC for Machine Learning: HPC Technologies are applied to learning tasks to
accelerate computation and/or solve larger problems.
• Machine Learning for HPC: Learning Technologies are applied to HPC
computations to improve their performance in some way, e.g., by choosing the
next simulation(s) to perform.
HPC for MLML for HPC
Ack: Gadi Singer
6
Application Priorities Determine Machine Learning Methods
• Deep Learning (CNN, RNN, etc.)
• Ensemble Methods and Random Forest Methods
• Reinforcement Learning
• Kernel Methods
• Tensor Methods
• Graph-Based learning
7
Overarching Goals for ExaLearn
• Provide exascale machine learning software for use by:
– ECP Applications Projects
– Other ECP Co-design Centers
– DOE Experimental Facilities
– DOE Leadership Class Computing Facilities
• ExaLearn will establish multidisciplinary collaborations in learning technologies
that cross-cut individual ECP projects:
– ECP AD projects that share an interest in a machine learning method
– ECP ST projects
– ECP HI/PathForward projects
8
Work Product is a Software Toolset that…
• Applies to multiple problems within the DOE mission
• Has a line-of-sight to exascale computing, e.g., by using exascale platforms
directly or providing essential components to an exascale workflow
• Does not replicate capabilities easily obtainable from existing, widely available
packages
• Builds in domain knowledge where possible (not often done by industry,
although efforts beginning IBM, GE, etc.); “physics”-based ML and AI (recurring
theme)
• Quantifies Uncertainty in predictive capacity
• Is interpretable
• Is reproducible.
9
Application Evaluation Criteria
• Would the work we propose directly benefit an ECP application/program, and/or an important non-ECP application, and/or
DOE user facility? Impact could include facility design and control, design of experiment, etc.
• Is it exascale in size?
(a) i.e., tuning hyperparameters on individual nodes over the whole machine (pseudo-embarrassingly parallel)
(b) i.e., deploying a complex network across the whole machine
(c) does any aspect of the ML problem (either data size or optimization time, etc., require exascale)
• Does it enable new science? What is the specific science question?
• What ML methods do they employ right now?
• Do they have examples in-hand on which we can benchmark at scale now?
• Are there cross-cut benefits to other science areas?
• What are the known performance considerations and/or tool barriers ?
• What data are available, what is the quality of the data, how is it curated, and what are the risks associated with using that
data?
• What would the workflow look like (training/ensemble/optimization/etc.)?
• Is there willingness/interest on the part of the application/facilities folks in working with ExaLearn ?
• Is there any existing software code base/how will feedback work/can we make a meaningful impact/will it lead to future
work like SciDAC?
10
ExaLearn for Surrogates–Making Realistic
Simulations on the Cheap
• Challenge and Importance: Many DOE simulation efforts could benefit from having realistic surrogate models
in place of computationally expensive simulations. These can be used to quickly flesh out parameter space,
help with real-time decision making, assist experimental design, and determine the best areas to perform
additional simulations. We will first target large-scale structure simulations of the universe. As the field is well
developed, the scale can easily be ramped up to an exascale ML challenge, and the field is robust enough to
explore systematics at the sub-percent level.
• ML impact: Neural-networks-based generative models can make reliable surrogate models of expensive
simulations for data augmentation purposes. Such surrogate models can be used to aid in cosmological
analysis to reduce systematic uncertainties in observations.
• Timeliness: The ExaSky Application projects is producing the largest LSS simulations now. The DESI
experiment starts next year, and LSST takes its first science images in 2021.
• Urgency: All cosmological measurements today are limited by systematics, not statistics. To reduce these
uncertainties and make the most of these future experiments, thousands, if not millions, of exascale-sized
simulations must be carried out. Surrogate models are a viable path forward to achieving this goal—only if their
limitations are fully understood.
• Benefit to ECP–Large DOE Experiments: Once demonstrated, this software framework can be easily
adapted to other fields and simulation areas.
11
CosmoGAN for Surrogates
• Cosmology simulations are extremely
computationally expensive:
– e.g., ‘Q-continuum’ HACC sim took ~2 weeks
– One product is weak lensing convergence maps to
compare to observations
• Seek to augment simulations with a learned
generative neural net
– Reproduce convergence maps–images
– Train using existing simulation maps with different
cosmological parameters
– Once trained produce new examples
– Use these maps to constrain observational systematics
quickly!
11
Mustafa Mustafa, Deborah Bard, Wahid Bhimji,
Rami Al-Rfou, Zarija Lukić
https://arxiv.org/abs/1706.02390
12
GANs and Cosmology
ΩΛ ΩM σ8…
Exascale in size:
• The surrogate model used are based on the Generative
Adversarial Networks (GAN) framework and currently are part of experiments with Variational
Auto-encoders. Both models will eventually need hyperparameter tuning across an exascale-
sized machine to extend to larger maps and build conditional generative models.
• For complex conditional models on large maps (20482 with the expectation to go to 80963), we
expect to use intra-node model parallelism (up to a few GPUs) and inter-node data parallelism
(10s-100s of nodes) for faster experimentation during algorithm development and final model
training. When such models’ hyperparameters are being tuned, they will require thousands of
nodes.
• After having been trained, when applied to generation, this GAN/VAE approach would be
coupled with large-scale cosmology applications running on exascale machines (such as the
ExaSky Apps HACC and Nyx).
• Future observational data sets will be multi-petabyte in size, and the simulation data are larger
by an order of magnitude or more.
13
• Science driver: Thin films of block-copolymer (BCP) cylinder-forming polystyrene-block-poly(methyl
methacrylate) are self-assembled during experiments at light sources. This is important to the understanding
and annealing of materials at the nanoscale.
• Our Use Case: Use reinforcement learning to develop a policy to guide the scientist to modify experimental
parameters (e.g., temperature) most likely to affect outcomes favorably with the goal of forming non-equilibrium
morphologies.
• Experiment: Grazing-Incidence Small-Angle X-ray Scattering (GISAXS) experiments can determine symmetry,
size, spacing, orientation, grain size, order/perfection, unit cell, and more.
• Synchrotron light sources are used for GISAXS experiments.
• Challenge: Accelerate self-assembly process with minimal use of experimental and computational resources.
• Approach: Train reinforcement learning model at supercomputing facility and deploy at user facility. Use
synthetic data for initial training and eventually incorporate experiment data into the model.
Machine Learning for Control: Directed Self-Assembly of Block Copolymers
14
• BCP cylinders go from poorly ordered
to hexagonal then to horizontally
ordered.
• The size, shape, and anisotropy of
grains are influenced by ordering
history; for instance, faster heating
rates reduce grain anisotropy (different
properties in different directions).
Cartoon, SEM and GISAXS images of BCP ordering history (from P. Majewski and K. Yager)
• Reinforcement learning system will first train on
simulated images that are converted to a
structure vector.
• Control parameters will be annealing
temperature and time, BCO composition, thin film
thickness, substrate surface chemistry, blending
ratio of different BCPs, and more
BCP Problem and Initial Reinforcement Learning Workflow Design
15
ExaLearn: Machine Learning for Inverse Problems in Materials Science
• Objective
– Experimental data acquisition (DAQ) technologies
• Diffraction
• Scattering
• Crystallography
• Other experimental procedures
– Current methods use forward models in refinement loops for determination of material structure and
properties
• Extremely time-consuming
• Speed, correctness, and quality of results depend on physics included in forward models
– ExaLearn will architect and deploy extreme-scale ML tools to enable fast, accurate determination of
materials structure directly from experimental data.
Expensive Loop Refinement Method
DAQ Machine
• Diffraction
• Scattering
• Crystallography
• …
Material
Structure
DescriptionLearning
ExaLearn
Approach
16
ExaLearn: Machine Learning for Inverse Problems in Materials Science
• Impact
• Enable learning from vast troves of materials-related empirical and
simulation data available within the DOE complex, as well as in
various repositories around the world quickly and more
accurately.
• Core ML technologies will be built based on domain-agnostic
design principles, but deployment will target domain-specific
benchmarks in neutron scattering and X-ray crystallography.
• Of particular interest are studies of an important class of materials
called perovskites (see chart).
• Point of Contact Sudip K. Seal (sealsk@ornl.gov, 865-574-8152)
Perovskite solar cells: An integrated hybrid lifecycle assessment and review in comparison with other photovoltaic technologies - Scientific Figure on ResearchGate.
Available from: https://www.researchgate.net/figure/Number-of-publications-resulting-from-the-search-of-perovskite-solar-cell-on-Web-of_fig3_317569861 [accessed 11
Jan, 2019]
Determine
Symmetry of
Structure
(Classification)
Instrument Information
Bragg Profile
Symmetry
Determine Internal
Parameters of
Structure
(Regression)
• Lattice parameters
• Atomic
coordinates
• Grain size & strain
• …
ExaLearn Pipeline for Material Structure Determination from Neutron Scattering/Diffraction Data
17
ExaLearn: Machine Learning for Design–Goal
• Find novel materials with beneficial properties.
• Combine simulation and reinforcement
learning at exascale to learn to intelligently
explore design space.
• Use simulation data to train generative
models to propose new materials for scientists
to test.
• Use Case 1: Electrolytes for improved
batteries and energy storage.
• Use Case 2: Catalysis for improved
chemical conversions and
transformations.
18
Scaling and Performance Cross-cut Area
• The scaling and performance cross-cut area’s goal is to identify or create
supporting ML/deep learning (DL) technologies and tools that integrate into the
software and infrastructure cross-cut area and serve the research needs for the
four application areas.
• The four application areas have diverse ML/DL requirements ranging from:
– Classification, regression, and generative DL models
– Reinforcement learning
– ML/DL-guided workflows driving predictive science applications
• Lightweight, in situ inference
• Continuous integration of new simulation output into trained model
– Creating fast surrogate models
19
Adapting Scalable ML/DL tools to ExaLearn Applications
• Application needs will be served by a combination of existing open-source
research tools, for example:
– Horovod plus TensorFlow and PyTorch
– LBANN
– OpenAI Gym
– CANDLE supervisor framework
Tool Data Parallelism Model Parallelism Ensemble Parallelism Environment parallelism
TensorFlow + Horovod X
PyTorch + Horovod X
LBANN Y Y Y (LTFB)
OpenAI Gym Y
CANDLE Sup. Y (PPBT) Y
20
● ODED: Intelligently steer a computational campaign
toward an objective. Determine “optimal” next
computing steps dynamically. A fundamental
dependence on adaptive workflows.
● ODED is a common functional requirement across all
four ExaLearn applications: Control, Design, Inverse,
and Surrogates. Also a recurring requirement in
Climate Modeling, Computational Drug Design
(CANDLE)
● Exemplar: ODED for Drug Design. Given a set of drug
candidates and computationally expensive protocols,
find optimal mix of protocols and parameters given a
constraint in HPC resource/allocation/time-to-solution.
● ExaLearn ODED Framework (Brookhaven Lead)
Objective-driven Experiment Design (ODED)
21
dlhub.org
22
ExaLearn ECP PathForward Vendor Engagement
• Leads: Michael Wolf (SNL) and James Ang (PNNL)
• Three major efforts
– Direct vendor engagement (Cray, Intel, HPE, AMD, IBM, NVIDIA). Discuss ExaLearn needs
with vendors, attend vendor “deep-dives,” and understand new technologies.
– ECP application and experimental facility engagement. Work with other seven ExaLearn
application and cross-cutting areas to develop proxy applications to reflect the system
utilization (computation, communication, etc.) of key ML kernels accurately in the context of
applications.
– ECP ML proxy applications. Work with ECP proxy application team to define ExaLearn proxy
applications and problem sets as part of ECP proxy application suite. Work with ECP
application development teams and PathForward vendors to ensure these are useful ML
drivers.
vendors an
projects
vendors and ECP
projects
Collaborate with DOE computing centers, HPC
vendors and ECP co-design and software technology
projects
15
vendors and ECP co-design and software technology
projects
15
23
CANDLE Project Provides Initial Take on Exascale
Learning Software Requirements
• Enable high productivity for DL-centric workflows
• Support Key DL frameworks on DOE supercomputers
(Keras, TF, Mxnet, CNTK)
• Support multiple paths to concurrency (Ensembles, Data
and Model Parallel)
• Manage training data, model search, scoring,
optimization, production training and inference (End-to-
End Workflow)
• CANDLE runtime/supervisor (interface with batch
schedulers)
• CANDLE Python library for improving model
development (UQ, HPO, CV, MV)
• Well-documented open examples and tutorials on
GitHub
• Leverage as much open source as possible (build only
what we need to add to existing frameworks)
ExaLearn will fill gaps, address new requirements, support other forms of learning,
address usability and scalability, and otherwise meet needs of new applications
exascaleproject.org
Thank you!

Más contenido relacionado

La actualidad más candente

La actualidad más candente (11)

Graphics card
Graphics cardGraphics card
Graphics card
 
Differnce of two processors
Differnce of two processorsDiffernce of two processors
Differnce of two processors
 
Innnovation and Futures Thinking - ISA16 - Cordoba
Innnovation and Futures Thinking - ISA16 - CordobaInnnovation and Futures Thinking - ISA16 - Cordoba
Innnovation and Futures Thinking - ISA16 - Cordoba
 
Stem cell therapy and organoid and 3D bioprinting
Stem cell therapy and organoid and 3D bioprintingStem cell therapy and organoid and 3D bioprinting
Stem cell therapy and organoid and 3D bioprinting
 
The Coca-Cola Company: BPM in the Context of Global ERP (Enterprise Resource ...
The Coca-Cola Company: BPM in the Context of Global ERP (Enterprise Resource ...The Coca-Cola Company: BPM in the Context of Global ERP (Enterprise Resource ...
The Coca-Cola Company: BPM in the Context of Global ERP (Enterprise Resource ...
 
Cerebral Edema
Cerebral EdemaCerebral Edema
Cerebral Edema
 
A fresh look at cell salvage
A fresh look at cell salvageA fresh look at cell salvage
A fresh look at cell salvage
 
Circulatory Assistance in Heart Failure
Circulatory Assistance in Heart Failure Circulatory Assistance in Heart Failure
Circulatory Assistance in Heart Failure
 
Brain tumors 2020 by Assist.Prof. Ines Strenja MD, PhD Clinical hospital R...
Brain tumors  2020 by Assist.Prof. Ines Strenja  MD, PhD  Clinical hospital R...Brain tumors  2020 by Assist.Prof. Ines Strenja  MD, PhD  Clinical hospital R...
Brain tumors 2020 by Assist.Prof. Ines Strenja MD, PhD Clinical hospital R...
 
Basics of cpb
Basics of cpbBasics of cpb
Basics of cpb
 
ECMO by DJ
ECMO by DJECMO by DJ
ECMO by DJ
 

Similar a ExaLearn Overview - ECP Co-Design Center for Machine Learning

Table of Contents
Table of ContentsTable of Contents
Table of Contents
butest
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
butest
 

Similar a ExaLearn Overview - ECP Co-Design Center for Machine Learning (20)

Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Table of Contents
Table of ContentsTable of Contents
Table of Contents
 
Exascale Computing Project Update
Exascale Computing Project UpdateExascale Computing Project Update
Exascale Computing Project Update
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
AI for Science
AI for ScienceAI for Science
AI for Science
 
A Space X Industry Day Briefing 7 Jul08 Jgm R4
A Space X Industry Day Briefing 7 Jul08 Jgm R4A Space X Industry Day Briefing 7 Jul08 Jgm R4
A Space X Industry Day Briefing 7 Jul08 Jgm R4
 
Software and Education at NSF/ACI
Software and Education at NSF/ACISoftware and Education at NSF/ACI
Software and Education at NSF/ACI
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Exascale Computing Project (ECP) Update
Exascale Computing Project (ECP) UpdateExascale Computing Project (ECP) Update
Exascale Computing Project (ECP) Update
 
sample-resume
sample-resumesample-resume
sample-resume
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Computational Thinking in the Workforce and Next Generation Science Standards...
Computational Thinking in the Workforce and Next Generation Science Standards...Computational Thinking in the Workforce and Next Generation Science Standards...
Computational Thinking in the Workforce and Next Generation Science Standards...
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Capella Days 2021 | How much time does modeling take? Experiences from modeli...
Capella Days 2021 | How much time does modeling take? Experiences from modeli...Capella Days 2021 | How much time does modeling take? Experiences from modeli...
Capella Days 2021 | How much time does modeling take? Experiences from modeli...
 

Más de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

Más de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

ExaLearn Overview - ECP Co-Design Center for Machine Learning

  • 1. exascaleproject.org ExaLearn Overview ECP’s Co-design Center for Exascale Machine Learning Technologies Project PI: Francis J. Alexander, BNL Partner PIs and Institutions: ▪ Ian Foster, ANL ▪ Peter Nugent, LBNL ▪ Brian Van Essen, LLNL ▪ Aric Hagberg, LANL ▪ David Womble, ORNL ▪ James A. Ang, PNNL ▪ Michael Wolf, SNL Place: HPC User Forum, Santa Fe, NM Date: April 2, 2019
  • 2. 2 Co-design is a Multidisciplinary Collaborative Process • In Co-design, HPC systems are developed with application software influencing hardware design trade-offs while also recognizing that applications and supporting software must be developed concurrently with hardware changes. • Under the status quo, HPC systems are developed without application input, and “advanced architectures” have a reputation of being “hard to program.” • Co-design encourages all parties, ranging from the system designers, computer architects, application software and tools developers, facilities, etc., to jointly design and optimize the system specifications via open communication and collaboration.
  • 3. 3 ECP’s Portfolio of Co-design Projects • CODAR: Co-design center for Online Data Analysis and Reduction • COPA: Co-design Center for Particle Applications • AMREX: Block-structured adaptive mesh refinement (AMR) Co-design Center • CEED: Center for Efficient Exascale Discretizations (CEED) • ExaGraph: Combinatorial Methods for Enabling Exascale Applications • ExaLearn: Co-design Center for Exascale Machine Learning Technologies
  • 4. 4 Three Overarching ExaLearn Co-design Questions • What are the best learning algorithms and methods for different application classes in terms of speed, accuracy, and resource requirements? How can we implement these algorithms to achieve scalability and performance portability? • What are the trade-offs in learning accuracy, resource needs, and overall application performance between using various learning methods? How do these trade-offs vary with exascale hardware and software choices? • How do we effectively orchestrate learning to reduce associated overheads? How can exascale hardware and software help with this orchestration?
  • 5. 5 Relationships between Machine Learning and HPC • HPC for Machine Learning: HPC Technologies are applied to learning tasks to accelerate computation and/or solve larger problems. • Machine Learning for HPC: Learning Technologies are applied to HPC computations to improve their performance in some way, e.g., by choosing the next simulation(s) to perform. HPC for MLML for HPC Ack: Gadi Singer
  • 6. 6 Application Priorities Determine Machine Learning Methods • Deep Learning (CNN, RNN, etc.) • Ensemble Methods and Random Forest Methods • Reinforcement Learning • Kernel Methods • Tensor Methods • Graph-Based learning
  • 7. 7 Overarching Goals for ExaLearn • Provide exascale machine learning software for use by: – ECP Applications Projects – Other ECP Co-design Centers – DOE Experimental Facilities – DOE Leadership Class Computing Facilities • ExaLearn will establish multidisciplinary collaborations in learning technologies that cross-cut individual ECP projects: – ECP AD projects that share an interest in a machine learning method – ECP ST projects – ECP HI/PathForward projects
  • 8. 8 Work Product is a Software Toolset that… • Applies to multiple problems within the DOE mission • Has a line-of-sight to exascale computing, e.g., by using exascale platforms directly or providing essential components to an exascale workflow • Does not replicate capabilities easily obtainable from existing, widely available packages • Builds in domain knowledge where possible (not often done by industry, although efforts beginning IBM, GE, etc.); “physics”-based ML and AI (recurring theme) • Quantifies Uncertainty in predictive capacity • Is interpretable • Is reproducible.
  • 9. 9 Application Evaluation Criteria • Would the work we propose directly benefit an ECP application/program, and/or an important non-ECP application, and/or DOE user facility? Impact could include facility design and control, design of experiment, etc. • Is it exascale in size? (a) i.e., tuning hyperparameters on individual nodes over the whole machine (pseudo-embarrassingly parallel) (b) i.e., deploying a complex network across the whole machine (c) does any aspect of the ML problem (either data size or optimization time, etc., require exascale) • Does it enable new science? What is the specific science question? • What ML methods do they employ right now? • Do they have examples in-hand on which we can benchmark at scale now? • Are there cross-cut benefits to other science areas? • What are the known performance considerations and/or tool barriers ? • What data are available, what is the quality of the data, how is it curated, and what are the risks associated with using that data? • What would the workflow look like (training/ensemble/optimization/etc.)? • Is there willingness/interest on the part of the application/facilities folks in working with ExaLearn ? • Is there any existing software code base/how will feedback work/can we make a meaningful impact/will it lead to future work like SciDAC?
  • 10. 10 ExaLearn for Surrogates–Making Realistic Simulations on the Cheap • Challenge and Importance: Many DOE simulation efforts could benefit from having realistic surrogate models in place of computationally expensive simulations. These can be used to quickly flesh out parameter space, help with real-time decision making, assist experimental design, and determine the best areas to perform additional simulations. We will first target large-scale structure simulations of the universe. As the field is well developed, the scale can easily be ramped up to an exascale ML challenge, and the field is robust enough to explore systematics at the sub-percent level. • ML impact: Neural-networks-based generative models can make reliable surrogate models of expensive simulations for data augmentation purposes. Such surrogate models can be used to aid in cosmological analysis to reduce systematic uncertainties in observations. • Timeliness: The ExaSky Application projects is producing the largest LSS simulations now. The DESI experiment starts next year, and LSST takes its first science images in 2021. • Urgency: All cosmological measurements today are limited by systematics, not statistics. To reduce these uncertainties and make the most of these future experiments, thousands, if not millions, of exascale-sized simulations must be carried out. Surrogate models are a viable path forward to achieving this goal—only if their limitations are fully understood. • Benefit to ECP–Large DOE Experiments: Once demonstrated, this software framework can be easily adapted to other fields and simulation areas.
  • 11. 11 CosmoGAN for Surrogates • Cosmology simulations are extremely computationally expensive: – e.g., ‘Q-continuum’ HACC sim took ~2 weeks – One product is weak lensing convergence maps to compare to observations • Seek to augment simulations with a learned generative neural net – Reproduce convergence maps–images – Train using existing simulation maps with different cosmological parameters – Once trained produce new examples – Use these maps to constrain observational systematics quickly! 11 Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, Zarija Lukić https://arxiv.org/abs/1706.02390
  • 12. 12 GANs and Cosmology ΩΛ ΩM σ8… Exascale in size: • The surrogate model used are based on the Generative Adversarial Networks (GAN) framework and currently are part of experiments with Variational Auto-encoders. Both models will eventually need hyperparameter tuning across an exascale- sized machine to extend to larger maps and build conditional generative models. • For complex conditional models on large maps (20482 with the expectation to go to 80963), we expect to use intra-node model parallelism (up to a few GPUs) and inter-node data parallelism (10s-100s of nodes) for faster experimentation during algorithm development and final model training. When such models’ hyperparameters are being tuned, they will require thousands of nodes. • After having been trained, when applied to generation, this GAN/VAE approach would be coupled with large-scale cosmology applications running on exascale machines (such as the ExaSky Apps HACC and Nyx). • Future observational data sets will be multi-petabyte in size, and the simulation data are larger by an order of magnitude or more.
  • 13. 13 • Science driver: Thin films of block-copolymer (BCP) cylinder-forming polystyrene-block-poly(methyl methacrylate) are self-assembled during experiments at light sources. This is important to the understanding and annealing of materials at the nanoscale. • Our Use Case: Use reinforcement learning to develop a policy to guide the scientist to modify experimental parameters (e.g., temperature) most likely to affect outcomes favorably with the goal of forming non-equilibrium morphologies. • Experiment: Grazing-Incidence Small-Angle X-ray Scattering (GISAXS) experiments can determine symmetry, size, spacing, orientation, grain size, order/perfection, unit cell, and more. • Synchrotron light sources are used for GISAXS experiments. • Challenge: Accelerate self-assembly process with minimal use of experimental and computational resources. • Approach: Train reinforcement learning model at supercomputing facility and deploy at user facility. Use synthetic data for initial training and eventually incorporate experiment data into the model. Machine Learning for Control: Directed Self-Assembly of Block Copolymers
  • 14. 14 • BCP cylinders go from poorly ordered to hexagonal then to horizontally ordered. • The size, shape, and anisotropy of grains are influenced by ordering history; for instance, faster heating rates reduce grain anisotropy (different properties in different directions). Cartoon, SEM and GISAXS images of BCP ordering history (from P. Majewski and K. Yager) • Reinforcement learning system will first train on simulated images that are converted to a structure vector. • Control parameters will be annealing temperature and time, BCO composition, thin film thickness, substrate surface chemistry, blending ratio of different BCPs, and more BCP Problem and Initial Reinforcement Learning Workflow Design
  • 15. 15 ExaLearn: Machine Learning for Inverse Problems in Materials Science • Objective – Experimental data acquisition (DAQ) technologies • Diffraction • Scattering • Crystallography • Other experimental procedures – Current methods use forward models in refinement loops for determination of material structure and properties • Extremely time-consuming • Speed, correctness, and quality of results depend on physics included in forward models – ExaLearn will architect and deploy extreme-scale ML tools to enable fast, accurate determination of materials structure directly from experimental data. Expensive Loop Refinement Method DAQ Machine • Diffraction • Scattering • Crystallography • … Material Structure DescriptionLearning ExaLearn Approach
  • 16. 16 ExaLearn: Machine Learning for Inverse Problems in Materials Science • Impact • Enable learning from vast troves of materials-related empirical and simulation data available within the DOE complex, as well as in various repositories around the world quickly and more accurately. • Core ML technologies will be built based on domain-agnostic design principles, but deployment will target domain-specific benchmarks in neutron scattering and X-ray crystallography. • Of particular interest are studies of an important class of materials called perovskites (see chart). • Point of Contact Sudip K. Seal (sealsk@ornl.gov, 865-574-8152) Perovskite solar cells: An integrated hybrid lifecycle assessment and review in comparison with other photovoltaic technologies - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Number-of-publications-resulting-from-the-search-of-perovskite-solar-cell-on-Web-of_fig3_317569861 [accessed 11 Jan, 2019] Determine Symmetry of Structure (Classification) Instrument Information Bragg Profile Symmetry Determine Internal Parameters of Structure (Regression) • Lattice parameters • Atomic coordinates • Grain size & strain • … ExaLearn Pipeline for Material Structure Determination from Neutron Scattering/Diffraction Data
  • 17. 17 ExaLearn: Machine Learning for Design–Goal • Find novel materials with beneficial properties. • Combine simulation and reinforcement learning at exascale to learn to intelligently explore design space. • Use simulation data to train generative models to propose new materials for scientists to test. • Use Case 1: Electrolytes for improved batteries and energy storage. • Use Case 2: Catalysis for improved chemical conversions and transformations.
  • 18. 18 Scaling and Performance Cross-cut Area • The scaling and performance cross-cut area’s goal is to identify or create supporting ML/deep learning (DL) technologies and tools that integrate into the software and infrastructure cross-cut area and serve the research needs for the four application areas. • The four application areas have diverse ML/DL requirements ranging from: – Classification, regression, and generative DL models – Reinforcement learning – ML/DL-guided workflows driving predictive science applications • Lightweight, in situ inference • Continuous integration of new simulation output into trained model – Creating fast surrogate models
  • 19. 19 Adapting Scalable ML/DL tools to ExaLearn Applications • Application needs will be served by a combination of existing open-source research tools, for example: – Horovod plus TensorFlow and PyTorch – LBANN – OpenAI Gym – CANDLE supervisor framework Tool Data Parallelism Model Parallelism Ensemble Parallelism Environment parallelism TensorFlow + Horovod X PyTorch + Horovod X LBANN Y Y Y (LTFB) OpenAI Gym Y CANDLE Sup. Y (PPBT) Y
  • 20. 20 ● ODED: Intelligently steer a computational campaign toward an objective. Determine “optimal” next computing steps dynamically. A fundamental dependence on adaptive workflows. ● ODED is a common functional requirement across all four ExaLearn applications: Control, Design, Inverse, and Surrogates. Also a recurring requirement in Climate Modeling, Computational Drug Design (CANDLE) ● Exemplar: ODED for Drug Design. Given a set of drug candidates and computationally expensive protocols, find optimal mix of protocols and parameters given a constraint in HPC resource/allocation/time-to-solution. ● ExaLearn ODED Framework (Brookhaven Lead) Objective-driven Experiment Design (ODED)
  • 22. 22 ExaLearn ECP PathForward Vendor Engagement • Leads: Michael Wolf (SNL) and James Ang (PNNL) • Three major efforts – Direct vendor engagement (Cray, Intel, HPE, AMD, IBM, NVIDIA). Discuss ExaLearn needs with vendors, attend vendor “deep-dives,” and understand new technologies. – ECP application and experimental facility engagement. Work with other seven ExaLearn application and cross-cutting areas to develop proxy applications to reflect the system utilization (computation, communication, etc.) of key ML kernels accurately in the context of applications. – ECP ML proxy applications. Work with ECP proxy application team to define ExaLearn proxy applications and problem sets as part of ECP proxy application suite. Work with ECP application development teams and PathForward vendors to ensure these are useful ML drivers. vendors an projects vendors and ECP projects Collaborate with DOE computing centers, HPC vendors and ECP co-design and software technology projects 15 vendors and ECP co-design and software technology projects 15
  • 23. 23 CANDLE Project Provides Initial Take on Exascale Learning Software Requirements • Enable high productivity for DL-centric workflows • Support Key DL frameworks on DOE supercomputers (Keras, TF, Mxnet, CNTK) • Support multiple paths to concurrency (Ensembles, Data and Model Parallel) • Manage training data, model search, scoring, optimization, production training and inference (End-to- End Workflow) • CANDLE runtime/supervisor (interface with batch schedulers) • CANDLE Python library for improving model development (UQ, HPO, CV, MV) • Well-documented open examples and tutorials on GitHub • Leverage as much open source as possible (build only what we need to add to existing frameworks) ExaLearn will fill gaps, address new requirements, support other forms of learning, address usability and scalability, and otherwise meet needs of new applications