Enviar búsqueda
Cargar
HPCMPUG2011 cray tutorial
•
1 recomendación
•
1,988 vistas
Jeff Larkin
Seguir
Maximizing Application Performance on the Cray XE6
Leer menos
Leer más
Tecnología
Denunciar
Compartir
Denunciar
Compartir
1 de 294
Descargar ahora
Descargar para leer sin conexión
Recomendados
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Jeff Larkin
Ibm power7
Ibm power7
Tom Presotto
Hyper v.nu-windows serverhyperv-networkingevolved
Hyper v.nu-windows serverhyperv-networkingevolved
hypervnu
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
imec
Linux on System z – performance update
Linux on System z – performance update
IBM India Smarter Computing
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
Obsidian Software
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
WSO2
Continuum PCAP
Continuum PCAP
rwachsman
Recomendados
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Jeff Larkin
Ibm power7
Ibm power7
Tom Presotto
Hyper v.nu-windows serverhyperv-networkingevolved
Hyper v.nu-windows serverhyperv-networkingevolved
hypervnu
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
imec
Linux on System z – performance update
Linux on System z – performance update
IBM India Smarter Computing
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
Obsidian Software
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
WSO2
Continuum PCAP
Continuum PCAP
rwachsman
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
Devang Sachdev
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
Devang Sachdev
GROMACS Molecular Dynamics on GPU
GROMACS Molecular Dynamics on GPU
Devang Sachdev
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
Dominic Monkhouse
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session
Mikyung Kang
AMBER Molecular Dynamics on GPU
AMBER Molecular Dynamics on GPU
Devang Sachdev
thread-clustering
thread-clustering
davidkftam
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
Os Wardenupdated
Os Wardenupdated
oscon2007
LAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPU
Devang Sachdev
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012
Daniel Mar
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
OpenCity Community
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
chiportal
Gpu archi
Gpu archi
Piyush Mittal
Accelerating science with Puppet
Accelerating science with Puppet
Tim Bell
Perf EMC VNX5100 vs IBM DS5300 Eng
Perf EMC VNX5100 vs IBM DS5300 Eng
Oleg Korol
Introducing JSR-283
Introducing JSR-283
David Nuescheler
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Ryousei Takano
Design and implementation of a reliable and cost-effective cloud computing in...
Design and implementation of a reliable and cost-effective cloud computing in...
Francesco Taurino
Brochure NAS LG
Brochure NAS LG
LG Electronics Chile
HP - HPC-29mai2012
HP - HPC-29mai2012
Agora Group
Exaflop In 2018 Hardware
Exaflop In 2018 Hardware
Jacob Wu
Más contenido relacionado
La actualidad más candente
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
Devang Sachdev
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
Devang Sachdev
GROMACS Molecular Dynamics on GPU
GROMACS Molecular Dynamics on GPU
Devang Sachdev
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
Dominic Monkhouse
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session
Mikyung Kang
AMBER Molecular Dynamics on GPU
AMBER Molecular Dynamics on GPU
Devang Sachdev
thread-clustering
thread-clustering
davidkftam
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
Os Wardenupdated
Os Wardenupdated
oscon2007
LAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPU
Devang Sachdev
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012
Daniel Mar
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
OpenCity Community
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
chiportal
Gpu archi
Gpu archi
Piyush Mittal
Accelerating science with Puppet
Accelerating science with Puppet
Tim Bell
Perf EMC VNX5100 vs IBM DS5300 Eng
Perf EMC VNX5100 vs IBM DS5300 Eng
Oleg Korol
Introducing JSR-283
Introducing JSR-283
David Nuescheler
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Ryousei Takano
Design and implementation of a reliable and cost-effective cloud computing in...
Design and implementation of a reliable and cost-effective cloud computing in...
Francesco Taurino
Brochure NAS LG
Brochure NAS LG
LG Electronics Chile
La actualidad más candente
(20)
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
GROMACS Molecular Dynamics on GPU
GROMACS Molecular Dynamics on GPU
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
2012 Fall OpenStack Bare-metal Speaker Session
2012 Fall OpenStack Bare-metal Speaker Session
AMBER Molecular Dynamics on GPU
AMBER Molecular Dynamics on GPU
thread-clustering
thread-clustering
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Os Wardenupdated
Os Wardenupdated
LAMMPS Molecular Dynamics on GPU
LAMMPS Molecular Dynamics on GPU
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
Gpu archi
Gpu archi
Accelerating science with Puppet
Accelerating science with Puppet
Perf EMC VNX5100 vs IBM DS5300 Eng
Perf EMC VNX5100 vs IBM DS5300 Eng
Introducing JSR-283
Introducing JSR-283
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Design and implementation of a reliable and cost-effective cloud computing in...
Design and implementation of a reliable and cost-effective cloud computing in...
Brochure NAS LG
Brochure NAS LG
Similar a HPCMPUG2011 cray tutorial
HP - HPC-29mai2012
HP - HPC-29mai2012
Agora Group
Exaflop In 2018 Hardware
Exaflop In 2018 Hardware
Jacob Wu
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
Benjamin Zores
IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.
meye0611
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
Federica Pisani
Sun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentation
xKinAnx
OpenDBCamp Virtualization
OpenDBCamp Virtualization
Liz van Dijk-Ameel
10Gbps transfers
10Gbps transfers
FileCatalyst
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Walter Moriconi
ISBI MPI Tutorial
ISBI MPI Tutorial
Daniel Blezek
Amd accelerated computing -ufrj
Amd accelerated computing -ufrj
Roberto Brandao
How to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation Savings
Isaac Christoffersen
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
Eric Verhulst
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentation
xKinAnx
LUG 2014
LUG 2014
Hitoshi Sato
04536342
04536342
fidan78
2013 02 08 annunci power 7 plus sito cta
2013 02 08 annunci power 7 plus sito cta
Lorenzo Corbetta
Larrabee
Larrabee
Krunal Siddhapathak
Hadoop on a personal supercomputer
Hadoop on a personal supercomputer
Paul Dingman
Similar a HPCMPUG2011 cray tutorial
(20)
HP - HPC-29mai2012
HP - HPC-29mai2012
Exaflop In 2018 Hardware
Exaflop In 2018 Hardware
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
Sun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentation
OpenDBCamp Virtualization
OpenDBCamp Virtualization
10Gbps transfers
10Gbps transfers
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
Infrastruttura Efficiente Di Sun E Amd -Virtualise with Confidence
ISBI MPI Tutorial
ISBI MPI Tutorial
Amd accelerated computing -ufrj
Amd accelerated computing -ufrj
How to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation Savings
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentation
LUG 2014
LUG 2014
04536342
04536342
2013 02 08 annunci power 7 plus sito cta
2013 02 08 annunci power 7 plus sito cta
Larrabee
Larrabee
Hadoop on a personal supercomputer
Hadoop on a personal supercomputer
Más de Jeff Larkin
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Jeff Larkin
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Jeff Larkin
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Jeff Larkin
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
Jeff Larkin
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Jeff Larkin
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
Jeff Larkin
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
Jeff Larkin
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
Jeff Larkin
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
Jeff Larkin
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
Jeff Larkin
May2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
Jeff Larkin
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
Jeff Larkin
XT Best Practices
XT Best Practices
Jeff Larkin
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Jeff Larkin
Más de Jeff Larkin
(16)
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
May2010 hex-core-opt
May2010 hex-core-opt
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
XT Best Practices
XT Best Practices
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Último
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Stephanie Beckett
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Fwdays
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
charlottematthew16
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
NavinnSomaal
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Ridwan Fadjar
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Mattias Andersson
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Padma Pradeep
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Enterprise Knowledge
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
charlottematthew16
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Slibray Presentation
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
The Digital Insurer
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Patryk Bandurski
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Fwdays
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
Zilliz
Último
(20)
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
HPCMPUG2011 cray tutorial
1.
2.
Review of
XT6 Architecture AMD Opteron Cray Networks Lustre Basics Programming Environment PGI Compiler Basics The Cray Compiler Environment Cray Scientific Libraries Cray Message Passing Toolkit Cray Performance Analysis Tools ATP CCM Optimizations CPU Communication I/O 2011 HPCMP User Group © Cray Inc. June 20, 2011 2
3.
AMD CPU Architecture
Cray Architecture Lustre Filesystem Basics 2011 HPCMP User Group © Cray Inc. June 20, 2011 3
4.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 4
5.
2003
2005 2007 2008 2009 2010 AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” Opteron™ Opteron™ Mfg. 130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333 2011 HPCMP User Group © Cray Inc. June 20, 2011 5
6.
12 cores
1.7-2.2Ghz 1 4 7 10 105.6Gflops 8 cores 5 11 1.8-2.4Ghz 2 8 76.8Gflops Power (ACP) 3 6 9 12 80Watts Stream 27.5GB/s Cache 12x 64KB L1 12x 512KB L2 12MB L3 2011 HPCMP User Group © Cray Inc. June 20, 2011 6
7.
L3 cache
HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache MEMORY CONTROLLER Core 2 MEMORY CONTROLLER Core 5 Core 8 Core 11 HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache Core 1 Core 4 Core 7 Core 10 L2 cache L2 cache L2 cache L2 cache Core 0 Core 3 Core 6 Core 9 2011 HPCMP User Group © Cray Inc. June 20, 2011 7
8.
A cache
line is 64B Unique L1 and L2 cache attached to each core L1 cache is 64 kbytes L2 cache is 512 kbytes L3 Cache is shared between 6 cores Cache is a “victim cache” All loads go to L1 immediately and get evicted down the caches Hardware prefetcher detects forward and backward strides through memory Each core can perform a 128b add and 128b multiply per clock cycle This requires SSE, packed instructions “Stride-one vectorization” 6 cores share a “flat” memory Non-uniform-memory-access (NUMA) beyond a node 2011 HPCMP User Group © Cray Inc. June 20, 2011 8
9.
Processor
Frequency Peak Bandwidth Balance (Gflops) (GB/sec) (bytes/flop ) Istanbul 2.6 62.4 12.8 0.21 (XT5) 2.0 64 42.6 0.67 MC-8 2.3 73.6 42.6 0.58 2.4 76.8 42.6 0.55 1.9 91.2 42.6 0.47 MC-12 2.1 100.8 42.6 0.42 2.2 105.6 42.6 0.40 2011 HPCMP User Group © Cray Inc. June 20, 2011 9
10.
Gemini (XE-series)
2011 HPCMP User Group © Cray Inc. June 20, 2011 10
11.
Microkernel on
Compute PEs, full featured Linux on Service PEs. Service PEs specialize by function Compute PE Software Architecture Login PE eliminates OS “Jitter” Network PE Software Architecture enables reproducible run times System PE Large machines boot in under I/O PE 30 minutes, including filesystem Service Partition Specialized Linux nodes 2011 HPCMP User Group © Cray Inc. June 20, 2011 11
12.
XE6
System External Login Server Boot RAID 10 GbE IB QDR 2011 HPCMP User Group © Cray Inc. June 20, 2011 13
13.
6.4 GB/sec direct
connect Characteristics HyperTransport Number of 16 or 24 (MC) Cores 32 (IL) Peak 153 Gflops/sec Performance MC-8 (2.4) Peak 211 Gflops/sec Performance MC-12 (2.2) Memory Size 32 or 64 GB per node Memory 83.5 GB/sec Bandwidth 83.5 GB/sec direct connect memory Cray SeaStar2+ Interconnect 2011 HPCMP User Group © Cray Inc. June 20, 2011 14
14.
Greyhound
Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel 6MB L3 HT3 6MB L3 Greyhound Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound DDR3 Channel Greyhound Greyhound DDR3 Channel HT3 HT3 Greyhound H Greyhound DDR3 Channel 6MB L3 Greyhound Greyhound T3 6MB L3 Greyhound Greyhound DDR3 Channel Cache Greyhound Cache Greyhound Greyhound Greyhound Greyhound HT3 Greyhound DDR3 Channel DDR3 Channel To Interconnect HT1 / HT3 2 Multi-Chip Modules, 4 Opteron Dies 8 Channels of DDR3 Bandwidth to 8 DIMMs 24 (or 16) Computational Cores, 24 MB of L3 cache Dies are fully connected with HT3 Snoop Filter Feature Allows 4 Die SMP to scale well 2011 HPCMP User Group © Cray Inc. June 20, 2011 15
15.
Without snoop filter,
a streams test shows 25MB/sec out of a possible 51.2 GB/sec or 48% of peak bandwidth 2011 HPCMP User Group © Cray Inc. June 20, 2011 16
16.
With snoop filter,
a streams test shows 42.3 MB/sec out of a possible 51.2 GB/sec or 82% of peak bandwidth This feature will be key for two- socket Magny Cours Nodes which are the same architecture-wise 2011 HPCMP User Group © Cray Inc. June 20, 2011 17
17.
New compute
blade with 8 AMD Magny Cours processors Plug-compatible with XT5 cabinets and backplanes Upgradeable to AMD’s “Interlagos” series XE6 systems ship with the current SIO blade 2011 HPCMP User Group © Cray Inc. June 20, 2011 18
18.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 19
19.
Supports 2
Nodes per ASIC 168 GB/sec routing capacity Scales to over 100,000 network endpoints Link Level Reliability and Adaptive Hyper Hyper Routing Transport Transport 3 3 Advanced Resiliency Features Provides global address NIC 0 Netlink NIC 1 SB space Block Gemini LO Advanced NIC designed to Processor efficiently support 48-Port MPI YARC Router One-sided MPI Shmem UPC, Coarray FORTRAN 2011 HPCMP User Group © Cray Inc. June 20, 2011 20
20.
Cray Baker Node
Characteristics Number of 16 or 24 10 12X Gemini Cores Channels Peak 140 or 210 Gflops/s (Each Gemini High Radix YARC Router Performance acts like two nodes on the 3D with adaptive Memory Size 32 or 64 GB per Torus) Routing node 168 GB/sec capacity Memory 85 GB/sec Bandwidth 2011 HPCMP User Group © Cray Inc. June 20, 2011 21
21.
Module with SeaStar
Z Y X Module with Gemini 2011 HPCMP User Group © Cray Inc. June 20, 2011 22
22.
net rsp
net req LB ht treq p net LB Ring ht treq np FMA req T net ht trsp net net A req S req req vc0 ht p req net R S req O ht np req B I BTE R net D rsp B vc1 Router Tiles HT3 Cave NL ht irsp NPT vc1 ht np net ireq rsp net req CQ NAT ht np req H ht p req net rsp headers ht p A AMO net ht p req ireq R net req req net req vc0 B RMT ht p req RAT net rsp LM CLM FMA (Fast Memory Access) Mechanism for most MPI transfers Supports tens of millions of MPI requests per second BTE (Block Transfer Engine) Supports asynchronous block transfers between local and remote memory, in either direction For use for large MPI transfers that happen in the background 2011 HPCMP User Group © Cray Inc. June 20, 2011 23
23.
Two Gemini
ASICs are packaged on a pin-compatible mezzanine card Topology is a 3-D torus Each lane of the torus is composed of 4 Gemini router “tiles” Systems with SeaStar interconnects can be upgraded by swapping this card 100% of the 48 router tiles on each Gemini chip are used 2011 HPCMP User Group © Cray Inc. June 20, 2011 24
24.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 28
25.
Name
Architecture Processor Network # Cores Memory/Core Jade XT-4 AMD Seastar 2.1 8584 2GB DDR2-800 Budapest (2.1 Ghz) Einstein XT-5 AMD Seastar 2.1 12827 2GB (some Shanghai (2.4 nodes have Ghz) 4GB/core) DDR2-800 MRAP XT-5 AMD Seastar 2.1 10400 4GB DDR2-800 Barcelona (2.3 Ghz) Garnet XE-6 Magny Cours Gemini 1.0 20160 2GB DDR3-1333 8 core 2.4 Ghz Raptor XE-6 Magny Cours Gemini 1.0 43712 2GB DDR3-1333 8 core 2.4 Ghz Chugach XE-6 Magny Cours Gemini 1.0 11648 2GB DDR3 -1333 8 core 2.3 Ghz 2011 HPCMP User Group © Cray Inc. June 20, 2011 29
26.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 30
27.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 31
28.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 32
29.
Low Velocity Airflow
High Velocity Airflow Low Velocity Airflow High Velocity Airflow 2011 HPCMP User Group © Cray Inc. June 20, 2011 33 Low Velocity Airflow
30.
Cool air is
released into the computer room Liquid Liquid/Vapor in Mixture out Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change (evaporation). R134a absorbs energy only in the presence of heated air. Phase change is 10x more efficient than pure water cooling. 2011 HPCMP User Group © Cray Inc. June 20, 2011 34
31.
R134a piping
Exit Evaporators Inlet Evaporator 2011 HPCMP User Group © Cray Inc. June 20, 2011 35
32.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 36
33.
Term
Meaning Purpose MDS Metadata Server Manages all file metadata for filesystem. 1 per FS OST Object Storage Target The basic “chunk” of data written to disk. Max 160 per file. OSS Object Storage Server Communicates with disks, manages 1 or more OSTs. 1 or more per FS Stripe Size Size of chunks. Controls the size of file chunks stored to OSTs. Can’t be changed once file is written. Stripe Count Number of OSTs used per Controls parallelism of file. Can’t file. be changed once file is writte. 2011 HPCMP User Group © Cray Inc. June 20, 2011 37
34.
2011 HPCMP User
Group © Cray Inc. une 20, 2011 J 38
35.
2011 HPCMP User
Group © Cray Inc. une 20, 2011 J 39
36.
32 MB
per OST (32 MB – 5 GB) and 32 MB Transfer Size Unable to take advantage of file system parallelism Access to multiple disks adds overhead which hurts performance Single Writer Write Performance 120 100 80 Write (MB/s) 1 MB Stripe 60 32 MB Stripe 40 Lustre 20 0 1 2 4 16 32 64 128 160 Stripe Count 40 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
37.
Single OST,
256 MB File Size Performance can be limited by the process (transfer size) or file system (stripe size) Single Writer Transfer vs. Stripe Size 140 120 100 Write (MB/s) 80 32 MB Transfer 60 8 MB Transfer 1 MB Transfer 40 Lustre 20 0 1 2 4 8 16 32 64 128 Stripe Size (MB) 41 2011 HPCMP User Group © Cray Inc. une 20, 2011 J
38.
Use the
lfs command, libLUT, or MPIIO hints to adjust your stripe count and possibly size lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe) lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe) export MPICH_MPIIO_HINTS=‘*: striping_factor=160’ Files inherit striping information from the parent directory, this cannot be changed once the file is written Set the striping before copying in files 2011 HPCMP User Group © Cray Inc. June 20, 2011 42
39.
Available Compilers
Cray Scientific Libraries Cray Message Passing Toolkit 2011 HPCMP User Group © Cray Inc. June 20, 2011 43
40.
Cray XT/XE
Supercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90) Fortran Compiler: ftn C Compiler: cc C++ Compiler: CC Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries Cray MPT (MPI, Shmem, etc.) Cray LibSci (BLAS, LAPACK, etc.) … Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly. Always load the appropriate xtpe-<arch> module for your machine Enables proper compiler target Links optimized math libraries 2011 HPCMP User Group © Cray Inc. June 20, 2011 44
41.
…from Cray’s Perspective
PGI – Very good Fortran and C, pretty good C++ Good vectorization Good functional correctness with optimization enabled Good manual and automatic prefetch capabilities Very interested in the Linux HPC market, although that is not their only focus Excellent working relationship with Cray, good bug responsiveness Pathscale – Good Fortran, C, possibly good C++ Outstanding scalar optimization for loops that do not vectorize Fortran front end uses an older version of the CCE Fortran front end OpenMP uses a non-pthreads approach Scalar benefits will not get as much mileage with longer vectors Intel – Good Fortran, excellent C and C++ (if you ignore vectorization) Automatic vectorization capabilities are modest, compared to PGI and CCE Use of inline assembly is encouraged Focus is more on best speed for scalar, non-scaling apps Tuned for Intel architectures, but actually works well for some applications on AMD 2011 HPCMP User Group © Cray Inc. June 20, 2011 45
42.
…from Cray’s Perspective
GNU so-so Fortran, outstanding C and C++ (if you ignore vectorization) Obviously, the best for gcc compatability Scalar optimizer was recently rewritten and is very good Vectorization capabilities focus mostly on inline assembly Note the last three releases have been incompatible with each other (4.3, 4.4, and 4.5) and required recompilation of Fortran modules CCE – Outstanding Fortran, very good C, and okay C++ Very good vectorization Very good Fortran language support; only real choice for Coarrays C support is quite good, with UPC support Very good scalar optimization and automatic parallelization Clean implementation of OpenMP 3.0, with tasks Sole delivery focus is on Linux-based Cray hardware systems Best bug turnaround time (if it isn’t, let us know!) Cleanest integration with other Cray tools (performance tools, debuggers, upcoming productivity tools) No inline assembly support 2011 HPCMP User Group © Cray Inc. June 20, 2011 46
43.
PGI
-fast –Mipa=fast(,safe) If you can be flexible with precision, also try -Mfprelaxed Compiler feedback: -Minfo=all -Mneginfo man pgf90; man pgcc; man pgCC; or pgf90 -help Cray <none, turned on by default> Compiler feedback: -rm (Fortran) -hlist=m (C) If you know you don’t want OpenMP: -xomp or -Othread0 man crayftn; man craycc ; man crayCC Pathscale -Ofast Note: this is a little looser with precision than other compilers Compiler feedback: -LNO:simd_verbose=ON man eko (“Every Known Optimization”) GNU -O2 / -O3 Compiler feedback: good luck man gfortran; man gcc; man g++ Intel -fast Compiler feedback: man ifort; man icc; man iCC 2011 HPCMP User Group © Cray Inc. June 20, 2011 47
44.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 48
45.
Traditional (scalar)
optimizations are controlled via -O# compiler flags Default: -O2 More aggressive optimizations (including vectorization) are enabled with the -fast or -fastsse metaflags These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre –Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz –Mpre Interprocedural analysis allows the compiler to perform whole-program optimizations. This is enabled with –Mipa=fast See man pgf90, man pgcc, or man pgCC for more information about compiler options. 2011 HPCMP User Group © Cray Inc. June 20, 2011 49
46.
Compiler feedback
is enabled with -Minfo and -Mneginfo This can provide valuable information about what optimizations were or were not done and why. To debug an optimized code, the -gopt flag will insert debugging information without disabling optimizations It’s possible to disable optimizations included with -fast if you believe one is causing problems For example: -fast -Mnolre enables -fast and then disables loop redundant optimizations To get more information about any compiler flag, add -help with the flag in question pgf90 -help -fast will give more information about the -fast flag OpenMP is enabled with the -mp flag 2011 HPCMP User Group © Cray Inc. June 20, 2011 50
47.
Some compiler options
may effect both performance and accuracy. Lower accuracy is often higher performance, but it’s also able to enforce accuracy. -Kieee: All FP math strictly conforms to IEEE 754 (off by default) -Ktrap: Turns on processor trapping of FP exceptions -Mdaz: Treat all denormalized numbers as zero -Mflushz: Set SSE to flush-to-zero (on with -fast) -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to speed up some floating point optimizations Some other compilers turn this on by default, PGI chooses to favor accuracy to speed by default. 2011 HPCMP User Group © Cray Inc. June 20, 2011 51
48.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 52
49.
Cray has
a long tradition of high performance compilers on Cray platforms (Traditional vector, T3E, X1, X2) Vectorization Parallelization Code transformation More… Investigated leveraging an open source compiler called LLVM First release December 2008 2011 HPCMP User Group © Cray Inc. June 20, 2011 53
50.
Fortran Source
C and C++ Source C and C++ Front End supplied by Edison Design Group, with Cray-developed Fortran Front End C & C++ Front End code for extensions and interface support Interprocedural Analysis Cray Inc. Compiler Technology Compiler Optimization and Parallelization X86 Code Cray X2 Code Generator Generator X86 Code Generation from Open Source LLVM, with Object File additional Cray-developed optimizations and interface support 2011 HPCMP User Group © Cray Inc. June 20, 2011 54
51.
Standard conforming
languages and programming models Fortran 2003 UPC & CoArray Fortran Fully optimized and integrated into the compiler No preprocessor involved Target the network appropriately: GASNet with Portals DMAPP with Gemini & Aries Ability and motivation to provide high-quality support for custom Cray network hardware Cray technology focused on scientific applications Takes advantage of Cray’s extensive knowledge of automatic vectorization Takes advantage of Cray’s extensive knowledge of automatic shared memory parallelization Supplements, rather than replaces, the available compiler choices 2011 HPCMP User Group © Cray Inc. June 20, 2011 55
52.
Make sure
it is available module avail PrgEnv-cray To access the Cray compiler module load PrgEnv-cray To target the various chip module load xtpe-[barcelona,shanghi,mc8] Once you have loaded the module “cc” and “ftn” are the Cray compilers Recommend just using default options Use –rm (fortran) and –hlist=m (C) to find out what happened man crayftn 2011 HPCMP User Group © Cray Inc. June 20, 2011 56
53.
Excellent Vectorization
Vectorize more loops than other compilers OpenMP 3.0 Task and Nesting PGAS: Functional UPC and CAF available today C++ Support Automatic Parallelization Modernized version of Cray X1 streaming capability Interacts with OMP directives Cache optimizations Automatic Blocking Automatic Management of what stays in cache Prefetching, Interchange, Fusion, and much more… 2011 HPCMP User Group © Cray Inc. June 20, 2011 57
54.
Loop Based
Optimizations Vectorization OpenMP Autothreading Interchange Pattern Matching Cache blocking/ non-temporal / prefetching Fortran 2003 Standard; working on 2008 PGAS (UPC and Co-Array Fortran) Some performance optimizations available in 7.1 Optimization Feedback: Loopmark Focus 2011 HPCMP User Group © Cray Inc. June 20, 2011 58
55.
Cray compiler
supports a full and growing set of directives and pragmas !dir$ concurrent !dir$ ivdep !dir$ interchange !dir$ unroll !dir$ loop_info [max_trips] [cache_na] ... Many more !dir$ blockable man directives man loop_info 2011 HPCMP User Group © Cray Inc. June 20, 2011 59
56.
Compiler can
generate an filename.lst file. Contains annotated listing of your source code with letter indicating important optimizations %%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers ------- ---- ---- --------- a - vector atomic memory operation A - Pattern matched b - blocked C - Collapsed f - fused D - Deleted i - interchanged E - Cloned m - streamed but not partitioned I - Inlined p - conditional, partial and/or computed M - Multithreaded r - unrolled P - Parallel/Tasked s - shortloop V - Vectorized t - array syntax temp used W - Unwound w - unwound 2011 HPCMP User Group © Cray Inc. June 20, 2011 60
57.
• ftn –rm
… or cc –hlist=m … 29. b-------< do i3=2,n3-1 30. b b-----< do i2=2,n2-1 31. b b Vr--< do i1=1,n1 32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3) 33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1) 34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1) 35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1) 36. b b Vr--> enddo 37. b b Vr--< do i1=2,n1-1 38. b b Vr r(i1,i2,i3) = v(i1,i2,i3) 39. b b Vr > - a(0) * u(i1,i2,i3) 40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) ) 41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) ) 42. b b Vr--> enddo 43. b b-----> enddo 44. b-------> enddo 2011 HPCMP User Group © Cray Inc. June 20, 2011 61
58.
ftn-6289 ftn: VECTOR
File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized. 2011 HPCMP User Group © Cray Inc. June 20, 2011 62
59.
-hbyteswapio
Link time option Applies to all unformatted fortran IO Assign command With the PrgEnv-cray module loaded do this: setenv FILENV assign.txt assign -N swap_endian g:su assign -N swap_endian g:du Can use assign to be more precise 2011 HPCMP User Group © Cray Inc. June 20, 2011 63
60.
OpenMP is
ON by default Optimizations controlled by –Othread# To shut off use –Othread0 or –xomp or –hnoomp Autothreading is NOT on by default; -hautothread to turn on Modernized version of Cray X1 streaming capability Interacts with OMP directives If you do not want to use OpenMP and have OMP directives in the code, make sure to make a run with OpenMP shut off at compile time 2011 HPCMP User Group © Cray Inc. June 20, 2011 64
61.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 65
62.
Cray have
historically played a role in scientific library development BLAS3 were largely designed for Crays Standard libraries were tuned for Cray vector processors (later COTS) Cray have always tuned standard libraries for Cray interconnect In the 90s, Cray provided many non-standard libraries Sparse direct, sparse iterative These days the goal is to remain portable (standard APIs) whilst providing more performance Advanced features, tuning knobs, environment variables 2011 HPCMP User Group © Cray Inc. June 20, 2011 66
63.
FFT
Dense Sparse BLAS CRAFFT CASK LAPACK FFTW ScaLAPACK PETSc IRT P-CRAFFT Trilinos CASE IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CRAFFT – Cray Adaptive FFT CASE – Cray Adaptive Simple Eigensolver 2011 HPCMP User Group © Cray Inc. June 20, 2011 69
64.
There are
many libsci libraries on the systems One for each of Compiler (intel, cray, gnu, pathscale, pgi ) Single thread, multiple thread Target (istanbul, mc12 ) Best way to use libsci is to ignore all of this Load the xtpe-module (some sites set this by default) E.g. module load xtpe-shanghai / xtpe-istanbul / xtpe-mc8 Cray’s drivers will link the library automatically PETSc, Trilinos, fftw, acml all have their own module Tip : make sure you have the correct library loaded e.g. –Wl, -ydgemm_ 2011 HPCMP User Group © Cray Inc. June 20, 2011 70
65.
Perhaps you
want to link another library such as ACML This can be done. If the library is provided by Cray, then load the module. The link will be performed with the libraries in the correct order. If the library is not provided by Cray and has no module, add it to the link line. Items you add to the explicit link will be in the correct place Note, to get explicit BLAS from ACML but scalapack from libsci Load acml module. Explicit calls to BLAS in code resolve from ACML BLAS calls from the scalapack code will be resolved from libsci (no way around this) 2011 HPCMP User Group © Cray Inc. June 20, 2011 71
66.
Threading capabilities
in previous libsci versions were poor Used PTHREADS (more explicit affinity etc) Required explicit linking to a _mp version of libsci Was a source of concern for some applications that need hybrid performance and interoperability with openMP LibSci 10.4.2 February 2010 OpenMP-aware LibSci Allows calling of BLAS inside or outside parallel region Single library supported (there is still a single thread lib) Usage – load the xtpe module for your system (mc12) GOTO_NUM_THREADS outmoded – use OMP_NUM_THREADS 2011 HPCMP User Group © Cray Inc. June 20, 2011 72
67.
Allows seamless
calling of the BLAS within or without a parallel region e.g. OMP_NUM_THREADS = 12 call dgemm(…) threaded dgemm is used with 12 threads !$OMP PARALLEL DO do call dgemm(…) single thread dgemm is used end do Some users are requesting a further layer of parallelism here (see later) 2011 HPCMP User Group © Cray Inc. June 20, 2011 73
68.
120
Libsci DGEMM efficiency 100 80 GFLOPs 1thread 60 3threads 6threads 40 9threads 12threads 20 0 Dimension (square) Inc. 2011 HPCMP User Group © Cray June 20, 2011 74
69.
140
Libsci-10.5.2 performance on 2 x MC12 2.0 GHz K=64 120 (Cray XE6) K=128 100 K=200 K=228 80 GFLOPS K=256 60 K=300 K=400 40 K=500 20 K=600 0 K=700 1 2 4 8 12 16 20 24 K=800 Number of threads 2011 HPCMP User Group © Cray Inc. June 20, 2011 75
70.
All BLAS
libraries are optimized for rank-k update * = However, a huge % of dgemm usage is not from solvers but explicit calls E.g. DCA++ matrices are of this form * = How can we very easily provide an optimization for these types of matrices? 2011 HPCMP User Group © Cray Inc. June 20, 2011 76
71.
Cray BLAS
existed on every Cray machine between Cray-2 and Cray X2 Cray XT line did not include Cray BLAS Cray’s expertise was in vector processors GotoBLAS was the best performing x86 BLAS LibGoto is now discontinued In Q3 2011 LibSci will be released with Cray BLAS 2011 HPCMP User Group © Cray Inc. June 20, 2011 77
72.
1.
Customers require more OpenMP features unobtainable with current library 2. Customers require more adaptive performance for unusual problems .e.g. DCA++ 3. Interlagos / Bulldozer is a dramatic shift in ISA/architecture/performance 4. Our auto-tuning framework has advanced to the point that we can tackle this problem (good BLAS is easy, excellent BLAS is very hard) 5. Need for Bit-reproducable BLAS at high-performance 2011 HPCMP User Group © Cray Inc. June 20, 2011 78
73.
"anything that can
be represented in C, Fortran or ASM code can be generated automatically by one instance of an abstract operator in high-level code“ In other words, if we can create a purely general model of matrix-multiplication, and create every instance of it, then at least one of the generated schemes will perform well 2011 HPCMP User Group © Cray Inc. June 20, 2011 79
74.
Start with
a completely general formulation of the BLAS Use a DSL that expresses every important optimization Auto-generate every combination of orderings, buffering, and optimization For every combination of the above, sweep all possible sizes For a given input set ( M, N, K, datatype, alpha, beta ) map the best dgemm routine to the input The current library should be a specific instance of the above Worst-case performance can be no worse than current library The lowest level of blocking is a hand-written assembly kernel 2011 HPCMP User Group © Cray Inc. June 20, 2011 80
75.
7.5 7.45 7.4 7.35 7.3
bframe GFLOPS 7.25 libsci 7.2 7.15 7.1 7.05 143 72 12 17 22 27 62 133 67 37 42 57 105 2 7 47 100 128 138 95 32 52 2011 HPCMP User Group © Cray Inc. June 20, 2011 81
76.
New optimizations
for Gemini network in the ScaLAPACK LU and Cholesky routines 1. Change the default broadcast topology to match the Gemini network 2. Give tools to allow the topology to be changed by the user 3. Give guidance on how grid-shape can affect the performance 2011 HPCMP User Group © Cray Inc. June 20, 2011 82
77.
Parallel Version
of LAPACK GETRF Panel Factorization Only single column block is involved The rest of PEs are waiting Trailing matrix update Major part of the computation Column-wise broadcast (Blocking) Row-wise broadcast (Asynchronous) Data is packed before sending using PBLAS Broadcast uses BLACS library These broadcasts are the major communication patterns 2011 HPCMP User Group © Cray Inc. June 20, 2011 83
78.
MPI default
Binomial Tree + node-aware broadcast All PEs makes implicit barrier to make sure the completion Not suitable for rank-k update Bidirectional-Ring broadcast Root PE makes 2 MPI Send calls to both of the directions The immediate neighbor finishes first ScaLAPACK’s default Better than MPI 2011 HPCMP User Group © Cray Inc. June 20, 2011 84
79.
Increasing Ring
Broadcast (our new default) Root makes a single MPI call to the immediate neighbor Pipelining Better than bidirectional ring The immediate neighbor finishes first Multi-Ring Broadcast (2, 4, 8 etc) The immediate neighbor finishes first The root PE sends to multiple sub-rings Can be done with tree algorithm 2 rings seems the best for row-wise broadcast of LU 2011 HPCMP User Group © Cray Inc. June 20, 2011 85
80.
Hypercube
Behaves like MPI default Too many collisions in the message traffic Decreasing Ring The immediate neighbor finishes last No benefit in LU Modified Increasing Ring Best performance in HPL As good as increasing ring 2011 HPCMP User Group © Cray Inc. June 20, 2011 86
81.
XDLU performance: 3072
cores, size=65536 10000 9000 8000 7000 6000 Gflops 5000 4000 3000 SRING IRING 2000 1000 0 32 64 32 64 32 64 32 64 32 64 48 48 24 24 12 12 32 32 16 16 64 64 128 128 256 256 96 96 192 192 NB / P / Q 2011 HPCMP User Group © Cray Inc. June 20, 2011 87
82.
XDLU performance: 6144
cores, size=65536 14000 12000 10000 8000 Gflops 6000 SRING 4000 IRING 2000 0 32 64 32 64 32 64 32 64 32 64 48 48 24 24 12 12 64 64 32 32 128 128 256 256 512 512 96 96 192 192 NB / P / Q 2011 HPCMP User Group © Cray Inc. June 20, 2011 88
83.
Row Major
Process Grid puts adjacent PEs in the same row Adjacent PEs are most probably located in the same node In flat MPI, 16 or 24 PEs are in the same node In hybrid mode, several are in the same node Most MPI sends in I-ring happen in the same node MPI has good shared-memory device Good pipelining Node 0 Node 1 Node 2 2011 HPCMP User Group © Cray Inc. June 20, 2011 89
84.
For PxGETRF:
The variables let users to choose SCALAPACK_LU_CBCAST broadcast algorithm : SCALAPACK_LU_RBCAST IRING increasing ring For PxPOTRF: (default value) DRING decreasing ring SCALAPACK_LLT_CBCAST SRING split ring (old default SCALAPCK_LLT_RBCAST value) SCALAPACK_UTU_CBCAST MRING multi-ring SCALAPACK_UTU_RBCAST HYPR hypercube MPI mpi_bcast TREE tree There is also a set function, allowing FULL full connected the user to change these on the fly 2011 HPCMP User Group © Cray Inc. June 20, 2011 91
85.
Grid shape
/ size Square grid is most common Try to use Q = x * P grids, where x = 2, 4, 6, 8 Square grids not often the best Blocksize Unlike HPL, fine-tuning not important. 64 usually the best Ordering Try using column-major ordering, it can be better BCAST The new default will be a huge improvement if you can make your grid the right way. If you cannot, play with the environment variables. 2011 HPCMP User Group © Cray Inc. June 20, 2011 92
86.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 93
87.
Full MPI2
support (except process spawning) based on ANL MPICH2 Cray used the MPICH2 Nemesis layer for Gemini Cray-tuned collectives Cray-tuned ROMIO for MPI-IO Current Release: 5.3.0 (MPICH 1.3.1) Improved MPI_Allreduce and MPI_alltoallv Initial support for checkpoint/restart for MPI or Cray SHMEM on XE systems Improved support for MPI thread safety. module load xt-mpich2 Tuned SHMEM library module load xt-shmem 2011 HPCMP User Group © Cray Inc. June 20, 2011 94
88.
MPI_Alltoall with 10,000
Processes Comparing Original vs Optimized Algorithms on Cray XE6 Systems 25000000 20000000 Microseconds 15000000 Original Algorithm 10000000 Optimized Algorithm 5000000 0 256 512 1024 2048 4096 8192 16384 32768 MessageHPCMP User Group © Cray Inc. 2011 Size (in bytes) June 20, 2011 95
89.
8-Byte MPI_Allgather and
MPI_Allgatherv Scaling Comparing Original vs Optimized Algorithms 45000 on Cray XE6 Systems 40000 MPI_Allgather and 35000 MPI_Allgatherv algorithms optimized for Cray XE6. 30000 Microseconds Original Allgather 25000 20000 Optimized Allgather 15000 Original Allgatherv 10000 Optimized Allgatherv 5000 0 1024p 2048p 4096p 8192p 16384p 32768p Number ofUser Group © Cray Inc. June 20, 2011 2011 HPCMP Processes 96
90.
Default is
8192 bytes Maximum size message that can go through the eager protocol. May help for apps that are sending medium size messages, and do better when loosely coupled. Does application have a large amount of time in MPI_Waitall? Setting this environment variable higher may help. Max value is 131072 bytes. Remember for this path it helps to pre-post receives if possible. Note that a 40-byte CH3 header is included when accounting for the message size. 2011 HPCMP User Group © Cray Inc. June 20, 2011 97
91.
Default is
64 32K buffers ( 2M total ) Controls number of 32K DMA buffers available for each rank to use in the Eager protocol described earlier May help to modestly increase. But other resources constrain the usability of a large number of buffers. 2011 HPCMP User Group © Cray Inc. June 20, 2011 98
92.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 99
93.
What do
I mean by PGAS? Partitioned Global Address Space UPC CoArray Fortran ( Fortran 2008 ) SHMEM (I will count as PGAS for convenience) SHMEM: Library based Not part of any language standard Compiler independent Compiler has no knowledge that it is compiling a PGAS code and does nothing different. I.E. no transformations or optimizations 2011 HPCMP User Group © Cray Inc. June 20, 2011 100
94.
UPC
Specification that extends the ISO/IEC 9899 standard for C Integrated into the language Heavily compiler dependent Compiler intimately involved in detecting and executing remote references Flexible, but filled with challenges like pointers, a lack of true multidimensional arrays, and many options for distributing data Fortran 2008 Now incorporates coarrays Compiler dependent Philosophically different from UPC Replication of arrays on every image with “easy and obvious” ways to access those remote locations. 2011 HPCMP User Group © Cray Inc. June 20, 2011 101
95.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 102
96.
Translate the
UPC source code into hardware executable operations that produce the proper behavior, as defined by the specification Storing to a remote location? Loading from a remote location? When does the transfer need to be complete? Are there any dependencies between this transfer and anything else? No ordering guarantees provided by the network, compiler is responsible for making sure everything gets to its destination in the correct order. 2011 HPCMP User Group © Cray Inc. June 20, 2011 103
97.
for ( i
= 0; i < ELEMS_PER_THREAD; i+=1 ) { local_data[i] += global_2d[i][target]; } for ( i = 0; i < ELEMS_PER_THREAD; i+=1 ) { temp = pgas_get(&global_2d[i]); // Initiate the get pgas_fence(); // makes sure the get is complete local_data[i] += temp; // Use the local location to complete the operation } The compiler must Recognize you are referencing a shared location Initiate the load of the remote data Make sure the transfer has completed Proceed with the calculation Repeat for all iterations of the loop 2011 HPCMP User Group © Cray Inc. June 20, 2011 104
98.
for ( i
= 0; i < ELEMS_PER_THREAD; i+=1 ) { temp = pgas_get(&global_2d[i]); // Initiate the get pgas_fence(); // makes sure the get is complete local_data[i] += temp; // Use the local location to complete the operation } Simple translation results in Single word references Lots of fences Little to no latency hiding No use of special hardware Nothing here says “fast” 2011 HPCMP User 105 June 20, 2011 Group © Cray Inc.
99.
Want the compiler
to generate code that will run as fast as possible given what the user has written, or allow the user to get fast performance with simple modifications. Increase message size Do multi / many word transfers whenever possible, not single word. Minimize fences Delay fence “as much as possible” Eliminate the fence in some circumstances Use the appropriate hardware Use on-node hardware for on-node transfers Use transfer mechanism appropriate for this message size Overlap communication and computation Use hardware atomic functions where appropriate 2011 HPCMP User Group © Cray Inc. June 20, 2011 106
100.
Primary Loop Type
Modifiers A - Pattern matched a - atomic memory operation b - blocked C - Collapsed c - conditional and/or computed D - Deleted E - Cloned f - fused G - Accelerated g - partitioned I - Inlined i - interchanged M - Multithreaded m - partitioned n - non-blocking remote transfer p - partial r - unrolled s - shortloop V - Vectorized w - unwound 2011 HPCMP User Group © Cray Inc. June 20, 2011 107
101.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 108
102.
15.
shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS]; … 83. 1 before = upc_ticks_now(); 84. 1 r8------< for ( i = 0, j = target; i < ELEMS_PER_THREAD ; 85. 1 r8 i += 1, j += THREADS ) { 86. 1 r8 n local_data[i]= global_1d[j]; 87. 1 r8------> } 88. 1 after = upc_ticks_now(); 1D get BW= 0.027598 Gbytes/s 2011 HPCMP User 109 June 20, 2011 Group © Cray Inc.
103.
15.
shared long global_1d[MAX_ELEMS_PER_THREAD * THREADS]; … 101. 1 before = upc_ticks_now(); 102. 1 upc_memget(&local_data[0],&global_1d[target],8*ELEMS_PER_THREAD); 103. 1 104. 1 after = upc_ticks_now(); 1D get BW= 0.027598 Gbytes/s 1D upc_memget BW= 4.972960 Gbytes/s upc_memget is 184 times faster!! 2011 HPCMP User 110 June 20, 2011 Group © Cray Inc.
104.
16.
shared long global_2d[MAX_ELEMS_PER_THREAD][THREADS]; … 121. 1 A-------< for ( i = 0; i < ELEMS_PER_THREAD; i+=1) { 122. 1 A local_data[i] = global_2d[i][target]; 123. 1 A-------> } 1D get BW= 0.027598 Gbytes/s 1D upc_memget BW= 4.972960 Gbytes/s 2D get time BW= 4.905653 Gbytes/s Pattern matching can give you the same performance as if using upc_memget 2011 HPCMP User 111 June 20, 2011 Group © Cray Inc.
105.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 112
106.
PGAS data
references made by the single statement immediately following the pgas defer_sync directive will not be synchronized until the next fence instruction. Only applies to next UPC/CAF statement Does not apply to upc “routines” Does not apply to shmem routines Normally the compiler synchronizes the references in a statement as late as possible without violating program semantics. The purpose of the defer_sync directive is to synchronize the references even later, beyond where the compiler can determine it is safe. Extremely powerful! Can easily overlap communication and computation with this statement Can apply to both “gets” and “puts” Can be used to implement a variety of “tricks”. Use your imagination! 2011 HPCMP User Group © Cray Inc. June 20, 2011 113
107.
CrayPAT 2011 HPCMP User
Group © Cray Inc. June 20, 2011 114
108.
Future system
basic characteristics: Many-core, hybrid multi-core computing Increase in on-node concurrency 10s-100s of cores sharing memory With or without a companion accelerator Vector hardware at the low level Impact on applications: Restructure / evolve applications while using existing programming models to take advantage of increased concurrency Expand on use of mixed-mode programming models (MPI + OpenMP + accelerated kernels, etc.) 2011 HPCMP User Group © Cray Inc. June 20, 2011 115
109.
Focus on
automation (simplify tool usage, provide feedback based on analysis) Enhance support for multiple programming models within a program (MPI, PGAS, OpenMP, SHMEM) Scaling (larger jobs, more data, better tool response) New processors and interconnects Extend performance tools to include pre-runtime optimization information from the Cray compiler 2011 HPCMP User Group © Cray Inc. June 20, 2011 116
110.
New predefined
wrappers (ADIOS, ARMCI, PetSc, PGAS libraries) More UPC and Co-array Fortran support Support for non-record locking file systems Support for applications built with shared libraries Support for Chapel programs pat_report tables available in Cray Apprentice2 2011 HPCMP User Group © Cray Inc. June 20, 2011 117
111.
Enhanced PGAS
support is available in perftools 5.1.3 and later Profiles of a PGAS program can be created to show: Top time consuming functions/line numbers in the code Load imbalance information Performance statistics attributed to user source by default Can expose statistics by library as well To see underlying operations, such as wait time on barriers Data collection is based on methods used for MPI library PGAS data is collected by default when using Automatic Profiling Analysis (pat_build –O apa) Predefined wrappers for runtime libraries (caf, upc, pgas) enable attribution of samples or time to user source UPC and SHMEM heap tracking coming in subsequent release -g heap will track shared heap in addition to local heap June 20, 2011 2011 HPCMP User Group © Cray Inc. 118
112.
Table 1:
Profile by Function Samp % | Samp | Imb. | Imb. |Group | | Samp | Samp % | Function | | | | PE='HIDE' 100.0% | 48 | -- | -- |Total |------------------------------------------ | 95.8% | 46 | -- | -- |USER ||----------------------------------------- || 83.3% | 40 | 1.00 | 3.3% |all2all || 6.2% | 3 | 0.50 | 22.2% |do_cksum || 2.1% | 1 | 1.00 | 66.7% |do_all2all || 2.1% | 1 | 0.50 | 66.7% |mpp_accum_long || 2.1% | 1 | 0.50 | 66.7% |mpp_alloc ||========================================= | 4.2% | 2 | -- | -- |ETC ||----------------------------------------- || 4.2% | 2 | 0.50 | 33.3% |bzero |========================================== June 20, 2011 2011 HPCMP User Group © Cray Inc. 119
113.
Table 2:
Profile by Group, Function, and Line Samp % | Samp | Imb. | Imb. |Group | | Samp | Samp % | Function | | | | Source | | | | Line | | | | PE='HIDE' 100.0% | 48 | -- | -- |Total |-------------------------------------------- | 95.8% | 46 | -- | -- |USER ||------------------------------------------- || 83.3% | 40 | -- | -- |all2all 3| | | | | mpp_bench.c 4| | | | | line.298 || 6.2% | 3 | -- | -- |do_cksum 3| | | | | mpp_bench.c ||||----------------------------------------- 4||| 2.1% | 1 | 0.25 | 33.3% |line.315 4||| 4.2% | 2 | 0.25 | 16.7% |line.316 ||||========================================= June 20, 2011 2011 HPCMP User Group © Cray Inc. 120
114.
Table 1:
Profile by Function and Callers, with Line Numbers Samp % | Samp |Group | | Function | | Caller | | PE='HIDE’ 100.0% | 47 |Total |--------------------------- | 93.6% | 44 |ETC ||-------------------------- || 85.1% | 40 |upc_memput 3| | | all2all:mpp_bench.c:line.298 4| | | do_all2all:mpp_bench.c:line.348 5| | | main:test_all2all.c:line.70 || 4.3% | 2 |bzero 3| | | (N/A):(N/A):line.0 || 2.1% | 1 |upc_all_alloc 3| | | mpp_alloc:mpp_bench.c:line.143 4| | | main:test_all2all.c:line.25 || 2.1% | 1 |upc_all_reduceUL 3| | | mpp_accum_long:mpp_bench.c:line.185 4| | | do_cksum:mpp_bench.c:line.317 5| | | do_all2all:mpp_bench.c:line.341 6| | | main:test_all2all.c:line.70 ||========================== June 20, 2011 2011 HPCMP User Group © Cray Inc. 121
115.
Table 1:
Profile by Function and Callers, with Line Numbers Time % | Time | Calls |Group | | | Function | | | Caller | | | PE='HIDE' 100.0% | 0.795844 | 73904.0 |Total |----------------------------------------- | 78.9% | 0.628058 | 41121.8 |PGAS ||---------------------------------------- || 76.1% | 0.605945 | 32768.0 |__pgas_put 3| | | | all2all:mpp_bench.c:line.298 4| | | | do_all2all:mpp_bench.c:line.348 5| | | | main:test_all2all.c:line.70 || 1.5% | 0.012113 | 10.0 |__pgas_barrier 3| | | | (N/A):(N/A):line.0 … June 20, 2011 2011 HPCMP User Group © Cray Inc. 122
116.
…
||======================================== | 15.7% | 0.125006 | 3.0 |USER ||---------------------------------------- || 12.2% | 0.097125 | 1.0 |do_all2all 3| | | | main:test_all2all.c:line.70 || 3.5% | 0.027668 | 1.0 |main 3| | | | (N/A):(N/A):line.0 ||======================================== | 5.4% | 0.042777 | 32777.2 |UPC ||---------------------------------------- || 5.3% | 0.042321 | 32768.0 |upc_memput 3| | | | all2all:mpp_bench.c:line.298 4| | | | do_all2all:mpp_bench.c:line.348 5| | | | main:test_all2all.c:line.70 |========================================= June 20, 2011 2011 HPCMP User Group © Cray Inc. 123
117.
New text
table icon Right click for table generation options 2011 HPCMP User Group © Cray Inc. June 20, 2011 124
118.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 125
119.
Scalability
New .ap2 data format and client / server model Reduced pat_report processing and report generation times Reduced app2 data load times Graphical presentation handled locally (not passed through ssh connection) Better tool responsiveness Minimizes data loaded into memory at any given time Reduced server footprint on Cray XT/XE service node Larger jobs supported Distributed Cray Apprentice2 (app2) client for Linux app2 client for Mac and Windows laptops coming later this year 2011 HPCMP User Group © Cray Inc. June 20, 2011 126
120.
CPMD
MPI, instrumented with pat_build –u, HWPC=1 960 cores Perftools 5.1.3 Perftools 5.2.0 .xf -> .ap2 88.5 seconds 22.9 seconds ap2 -> report 1512.27 seconds 49.6 seconds VASP MPI, instrumented with pat_build –gmpi –u, HWPC=3 768 cores Perftools 5.1.3 Perftools 5.2.0 .xf -> .ap2 45.2 seconds 15.9 seconds ap2 -> report 796.9 seconds 28.0 seconds 2011 HPCMP User Group © Cray Inc. June 20, 2011 127
121.
‘:’ signifies From
Linux desktop – a remote host instead of % module load perftools ap2 file % app2 % app2 kaibab: % app2 kaibab:/lus/scratch/heidi/swim+pat+10302-0t.ap2 File->Open Remote… 2011 HPCMP User Group © Cray Inc. June 20, 2011 128
122.
Optional app2
client for Linux desktop available as of 5.2.0 Can still run app2 from Cray service node Improves response times as X11 traffic is no longer passed through the ssh connection Replaces 32-bit Linux desktop version of Cray Apprentice2 Uses libssh to establish connection app2 clients for Windows and Mac coming in subsequent release 2011 HPCMP User Group © Cray Inc. June 20, 2011 129
123.
Linux desktop
All data from Cray XT login Collected Compute nodes my_program.ap2 + performance X Window X11 protocol data app2 System application my_program.ap2 my_program+apa Log into Cray XT/XE login node % ssh –Y seal Launch Cray Apprentice2 on Cray XT/XE login node % app2 /lus/scratch/mydir/my_program.ap2 User Interface displayed on desktop via ssh trusted X11 forwarding Entire my_program.ap2 file loaded into memory on XT login node (can be Gbytes of data) 2011 HPCMP User Group © Cray Inc. June 20, 2011 130
124.
Linux desktop
User requested data Cray XT login Collected Compute nodes from X Window performance my_program.ap2 app2 server System data application my_program.ap2 my_program+apa app2 client Launch Cray Apprentice2 on desktop, point to data % app2 seal:/lus/scratch/mydir/my_program.ap2 User Interface displayed on desktop via X Windows-based software Minimal subset of data from my_program.ap2 loaded into memory on Cray XT/XE service node at any given time Only data requested sent from server to client 2011 HPCMP User Group © Cray Inc. June 20, 2011 131
125.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 132
126.
Major change
to the way HW counters are collected starting with CPMAT 5.2.1 and CLE 4.0 (In conjunction with Interlagos support) Linux has officially incorporated support for accessing counters through a perf_events subsystem. Until this, Linux kernels have had to be patched to add support for perfmon2 which provided access to the counters for PAPI and for CrayPat. Seamless to users except – Overhead incurred when accessing counters has increased Creates additional application perturbation Working to bring this back in line with perfmon2 overhead 2011 HPCMP User Group © Cray Inc. June 20, 2011 133
127.
When possible,
CrayPat will identify dominant communication grids (communication patterns) in a program Example: nearest neighbor exchange in 2 or 3 dimensions Sweep3d uses a 2-D grid for communication Determine whether or not a custom MPI rank order will produce a significant performance benefit Custom rank orders are helpful for programs with significant point-to-point communication Doesn’t interfere with MPI collective communication optimizations 2011 HPCMP User Group © Cray Inc. June 20, 2011 134
128.
Focuses on
intra-node communication (place ranks that communication frequently on the same node, or close by) Option to focus on other metrics such as memory bandwidth Determine rank order used during run that produced data Determine grid that defines the communication Produce a custom rank order if it’s beneficial based on grid size, grid order and cost metric Summarize findings in report Describe how to re-run with custom rank order 2011 HPCMP User Group © Cray Inc. June 20, 2011 135
129.
For Sweep3d with
768 MPI ranks: This application uses point-to-point MPI communication between nearest neighbors in a 32 X 24 grid pattern. Time spent in this communication accounted for over 50% of the execution time. A significant fraction (but not more than 60%) of this time could potentially be saved by using the rank order in the file MPICH_RANK_ORDER.g which was generated along with this report. To re-run with a custom rank order … 2011 HPCMP User Group © Cray Inc. June 20, 2011 136
130.
Assist the
user with application performance analysis and optimization Help user identify important and meaningful information from potentially massive data sets Help user identify problem areas instead of just reporting data Bring optimization knowledge to a wider set of users Focus on ease of use and intuitive user interfaces Automatic program instrumentation Automatic analysis Target scalability issues in all areas of tool development Data management Storage, movement, presentation June 20, 2011 2011 HPCMP User Group © Cray Inc. 137
131.
Supports traditional
post-mortem performance analysis Automatic identification of performance problems Indication of causes of problems Suggestions of modifications for performance improvement CrayPat pat_build: automatic instrumentation (no source code changes needed) run-time library for measurements (transparent to the user) pat_report for performance analysis reports pat_help: online help utility Cray Apprentice2 Graphical performance analysis and visualization tool June 20, 2011 2011 HPCMP User Group © Cray Inc. 138
132.
CrayPat
Instrumentation of optimized code No source code modification required Data collection transparent to the user Text-based performance reports Derived metrics Performance analysis Cray Apprentice2 Performance data visualization tool Call tree view Source code mappings June 20, 2011 2011 HPCMP User Group © Cray Inc. 139
133.
When performance
measurement is triggered External agent (asynchronous) Sampling Timer interrupt Hardware counters overflow Internal agent (synchronous) Code instrumentation Event based Automatic or manual instrumentation How performance data is recorded Profile ::= Summation of events over time run time summarization (functions, call sites, loops, …) Trace file ::= Sequence of events over time June 20, 2011 2011 HPCMP User Group © Cray Inc. 140
134.
Millions of
lines of code Automatic profiling analysis Identifies top time consuming routines Automatically creates instrumentation template customized to your application Lots of processes/threads Load imbalance analysis Identifies computational code regions and synchronization calls that could benefit most from load balance optimization Estimates savings if corresponding section of code were balanced Long running applications Detection of outliers June 20, 2011 2011 HPCMP User Group © Cray Inc. 141
135.
Important performance
statistics: Top time consuming routines Load balance across computing resources Communication overhead Cache utilization FLOPS Vectorization (SSE instructions) Ratio of computation versus communication June 20, 2011 2011 HPCMP User Group © Cray Inc. 142
136.
No source
code or makefile modification required Automatic instrumentation at group (function) level Groups: mpi, io, heap, math SW, … Performs link-time instrumentation Requires object files Instruments optimized code Generates stand-alone instrumented program Preserves original binary Supports sample-based and event-based instrumentation June 20, 2011 2011 HPCMP User Group © Cray Inc. 143
137.
Analyze the performance data and direct the user to meaningful information Simplifies the procedure to instrument and collect performance data for novice users Based on a two phase mechanism 1. Automatically detects the most time consuming functions in the application and feeds this information back to the tool for further (and focused) data collection 2. Provides performance information on the most significant parts of the application June 20, 2011 2011 HPCMP User Group © Cray Inc. 144
138.
Performs data
conversion Combines information from binary with raw performance data Performs analysis on data Generates text report of performance results Formats data for input into Cray Apprentice2 June 20, 2011 2011 HPCMP User Group © Cray Inc. 145
139.
Craypat /
Cray Apprentice2 5.0 released September 10, 2009 New internal data format FAQ Grid placement support Better caller information (ETC group in pat_report) Support larger numbers of processors Client/server version of Cray Apprentice2 Panel help in Cray Apprentice2 June 20, 2011 2011 HPCMP User Group © Cray Inc. 146
140.
Access performance tools software % module load perftools Build application keeping .o files (CCE: -h keepfiles) % make clean % make Instrument application for automatic profiling analysis You should get an instrumented program a.out+pat % pat_build –O apa a.out Run application to get top time consuming routines You should get a performance file (“<sdatafile>.xf”) or multiple files in a directory <sdatadir> % aprun … a.out+pat (or qsub <pat script>) June 20, 2011 2011 HPCMP User Group © Cray Inc. 147
141.
Generate report and .apa instrumentation file % pat_report –o my_sampling_report [<sdatafile>.xf | <sdatadir>] Inspect .apa file and sampling report Verify if additional instrumentation is needed June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 148
142.
# You can
edit this file, if desired, and use it # 43.37% 99659 bytes # to reinstrument the program for tracing like this: -T mlwxyz_ # # pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes # -T half_ # These suggested trace options are based on data from: # # 6.82% 6846 bytes # /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_ /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf # 1.29% 5352 bytes # ---------------------------------------------------------------------- -T currenh_ # HWPC group to collect by default. # 1.03% 25294 bytes -T bndbo_ -Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics. # Functions below this point account for less than 10% of samples. # ---------------------------------------------------------------------- # Libraries to trace. # 1.03% 31240 bytes # -T bndto_ -g mpi ... # ---------------------------------------------------------------------- # ---------------------------------------------------------------------- # User-defined functions to trace, sorted by % of samples. # Limited to top 200. A function is commented out if it has < 1% -o mhd3d.x+apa # New instrumented program. # of samples, or if a cumulative threshold of 90% has been reached, # or if it has size < 200 bytes. /work/crayadm/ldr/mhd3d/mhd3d.x # Original program. # Note: -u should NOT be specified as an additional option. June 20, 2011 149 2011 HPCMP User Group © Cray Inc.
143.
biolib Cray Bioinformatics library routines omp OpenMP API (not supported on blacs Basic Linear Algebra communication Catamount) subprograms omp-rtl OpenMP runtime library (not blas Basic Linear Algebra subprograms supported on Catamount) caf Co-Array Fortran (Cray X2 systems only) portals Lightweight message passing API fftw Fast Fourier Transform library (64-bit pthreads POSIX threads (not supported on only) Catamount) hdf5 manages extremely large and complex scalapack Scalable LAPACK data collections shmem SHMEM heap dynamic heap stdio all library functions that accept or return io includes stdio and sysio groups the FILE* construct lapack Linear Algebra Package sysio I/O system calls lustre Lustre File System system system calls math ANSI math upc Unified Parallel C (Cray X2 systems only) mpi MPI netcdf network common data form (manages array-oriented scientific data) 2011 HPCMP User Group © Cray Inc. June 20, 2011 150
144.
0
Summary with instruction 11 Floating point operations metrics mix (2) 1 Summary with TLB metrics 12 Floating point operations mix (vectorization) 2 L1 and L2 metrics 13 Floating point operations 3 Bandwidth information mix (SP) 4 Hypertransport information 14 Floating point operations 5 Floating point mix mix (DP) 6 Cycles stalled, resources 15 L3 (socket-level) idle 16 L3 (core-level reads) 7 Cycles stalled, resources 17 L3 (core-level misses) full 18 L3 (core-level fills caused 8 Instructions and branches by L2 evictions) 9 Instruction cache 19 Prefetches 10 Cache hierarchy 2011 HPCMP User Group © Cray Inc. June 20, 2011 Slide 151
145.
Regions, useful
to break up long routines int PAT_region_begin (int id, const char *label) int PAT_region_end (int id) Disable/Enable Profiling, useful for excluding initialization int PAT_record (int state) Flush buffer, useful when program isn’t exiting cleanly int PAT_flush_buffer (void) 2011 HPCMP User Group © Cray Inc. June 20, 2011 153
146.
Instrument application for further analysis (a.out+apa) % pat_build –O <apafile>.apa Run application % aprun … a.out+apa (or qsub <apa script>) Generate text report and visualization file (.ap2) % pat_report –o my_text_report.txt [<datafile>.xf | <datadir>] View report in text and/or with Cray Apprentice2 % app2 <datafile>.ap2 June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 154
147.
MUST run
on Lustre ( /work/… , /lus/…, /scratch/…, etc.) Number of files used to store raw data 1 file created for program with 1 – 256 processes √n files created for program with 257 – n processes Ability to customize with PAT_RT_EXPFILE_MAX June 20, 2011 2011 HPCMP User Group © Cray Inc. 155
148.
Full trace
files show transient events but are too large Current run-time summarization misses transient events Plan to add ability to record: Top N peak values (N small) Approximate std dev over time For time, memory traffic, etc. During tracing and sampling June 20, 2011 2011 HPCMP User Group © Cray Inc. Slide 156
149.
Call graph
profile Cray Apprentice2 Communication statistics is target to help identify and Time-line view correct: Communication Load imbalance I/O Excessive communication Network contention Activity view Excessive serialization Pair-wise communication statistics I/O Problems Text reports Source code mapping June 20, 2011 157 2011 HPCMP User Group © Cray Inc.
150.
Switch Overview display June
20, 2011 2011 HPCMP User Group © Cray Inc. 158
151.
2011 HPCMP User
Group © Cray Inc. June 20, 2011 Slide 159
152.
June 20, 2011
2011 HPCMP User Group © Cray Inc. 160
153.
June 20, 2011
2011 HPCMP User Group © Cray Inc. 161
154.
Min, Avg, and
Max Values -1, +1 Std Dev marks June 20, 2011 2011 HPCMP User Group © Cray Inc. 162
Descargar ahora