SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Automatic Generation of High-
Order Finite-Difference Code with
Temporal Blocking For Extreme-
Scale Many-Core Systems
ESPM2 2018
Nov.12th, Dallas
Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto,
Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori,
Miyuki Tsubouchi, Jun Makino
Abstract
 For an explicit finite-difference scheme applied to computation
fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency
of peak performance on the large-scale PEZY-SC2 based system
which has very low B/F by temporal blocking
 The achieved efficiency is comparable to recent works on very
high B/F systems
 To achieve this high efficiency on a low B/F machine,
we developed
 A framework for explicit stencil computation which generates
the boilerplate code for MPI and device kernel code with
temporal blocking
 A finite-difference scheme suitable for temporal blocking
Table of Contents
 Introduction
 Explicit stencil computation
 Temporal blocking
 About PEZY-SC2
 Details of our work
 Code generation framework: Formura
 Optimization for PEZY-SC2
 Benchmark results
 Performance on large-scale systems (Gyoukou)
 Discussion and summary
Introduction
Explicit Stencil Computation
 Explicit stencil computation is simple but very important
application of HPC
 It is used for simulating weather, earthquake, inside of
the sun, etc.
 Optimizing stencil computation is very important
Source: Riken
Efficiency of Recent Stencil Computation
 Efficiency of explicit method on recent HPC
hardware is not high enough
 Even the best case efficiency on K computer is ≅ 20%,
other many cases are only ≅ 10% (not high enough)
 This low efficiency is caused by the problem in the
architecture of processors, or memory bandwidth
 We try to solve the problem of memory bandwidth
which does not depend on architecture
 In the past decades, B/F of HPC systems has been
reduced dramatically
 This trend seems likely to continue
Relative Performance Trend
 Green: FLOPS vs
memory bandwidth
(4.5x/decade)
 Red: FLOPS vs
network latency
(~30x/decade)
 This seems to
continue
Source: John D. McCalpin
PEZY-SC2
 Many core MIMD processor
 1984 individual RISC cores
 2.8TFlops peak DP performance (@700MHz)
 4ch DDR4 DRAM, 64GB, 80GB/s
 ⇒ B/F≒0.03
 cf. K computer: 0.5
Tesla V100: 0.12
TaihuLight: 0.04
PEZY-SC2 Architecture
 The chip consists of
8 prefectures
 Each prefecture
contains 16 cities
 Each city contains 16
processor elements
(PEs)
 8×16×16-64(redun-
dancy) = 1984PEs
 Each city shares L2
cache
Gyoukou
 Supercomputer installed at JAMSTEC, Japan
 (Available until April 2018)
 Peak 28.2PFlops (Full nodes)
 Top500 4th (Nov 2017)
 10000 PEZY-SC2s +
1250 Xeon D (1 for 8-SC2s)
 World’s largest numbers of
MIMD processor cores
(≒ 20M
cf. TaihuLight ≒ 11M)
 Suitable for the test to check
if the code can scale to exa-scale systems
Details of the work
Temporal Blocking (TB)
 One of the solution to explicit method on low B/F system
 With TB, multiple timesteps are calculated for working
array, so it can reduce required B/F when the working
array fits to the processor cache memory size
DRAM DRAM
Cache Cache Cache Cache
Computation
for one time step
Read from memory Write to memory
Network Network
・・・
Various Methods of TB
A variation of this used for
inter-node communication
This is used for
in-node computation
Detail of Our TB Calculation
 Inter-node communication
 Each node sends data to one-direction
 Each node receives data from one-direction
 Simple communication-computation overwrapping
Details of Our TB calculation
 Computation starts from the right-most block
 Upper-right of the parallelogram use dummy data to
equalize all loop lengths
 Gray part is unnecessary results
 This method increases few computation
SL4TH3 Scheme
 Fourth order accuracy
 Number of stencil = 2
 Flop per cell per step ≒ 2800
 Required B/F ≒ 0.05 w/o TB
Input Differential Equation
Formura: a Framework for TB
 From a description of a stencil written in formura DSL,
optimized distributed parallel codes for large scale
parallel computers are generated
 In this work, we add the support of the TB method for
this work, and developed a device kernel code generator
for PEZY-SC2
Formura
DSL
MPI driver
code
TB kernel
code
Executable
formura gcc/mpicc
formura
Code generation by Formura
Input Equation
(Needs some more configuration files)
(Very part of) Generated C Codes
・・・
Output (Zoomed-in)
Code Generation
 Formura generates:
 Driver code for TB distributing on MPI
 Optimized kernel codes for node-local computations
 For new accelerator (or any other processors), we can
add a backend by modifying the code that calculates
temporal blocking steps
 Typically, major optimizations for each device are
blocking layout for data access locality and thread
scheduling
Optimization for PEZY-SC2
 Decide the block size
 Size of block is smaller than LLC size
 Parallelism close to number of PEs for load-balancing
 44× 44× 44 is the best block size
 443×10(variables/cell)×8Byte = 6.50MB
 6.50MB×2 (for overwrapping read and write)<32MB=LLC size
 442 (parallelisms) = 1936≒ 1984(number of PEs)
 Total 880(=44×20)3 cells per node < 64GB
 Allocate adjacent cells in PEs which shares L2
 Decrease inner-most loop instruction size
 PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)
 SL4TH3 requires 2800ops/cell
Results
Benchmark Results
 Conditions
 SL4TH3 scheme
 Optimized backend for PEZY-SC2
 8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou
 ≒ 16M cores
 Total (880×20=)17600 3 cells
 Performance results
 4.78 PFlops
 21.5% efficiency (22.2PFlops theoretical-peak)
Effect of Temporal Blocking
NT Redundant
calculation
by TB
Required B/F
1 1.4% 0.058
2 2.7% 0.029
3 4.0% 0.020
4 5.4% 0.015
5 6.7% 0.012
6 8.0% 0.010
7 9.2% 0.009
8 10.5% 0.008
Required B/F by NT
(size per node = 8803) Calculation speed by NT
GF
NT
Calculation speed by NT
(= time step parameter)
Comparison with Other Studies
 This work achieves very high efficiency
 Comparable result with very high B/F (= 0.5) system
0
0.1
0.2
0.3
0.4
0.5
0.6
0
5
10
15
20
25
30
Yashiro et. al. Yang et. al. Hotta et. al. This work
B/F
Efficiency(%)
Efficiency
列 1
Device B/F
Weak Scaling
 The communication is
completely hidden by
computation
 Thus even though the
actual time for
communication increase
when we increase the
number of nodes, weak
scaling of the performance
is pretty good
Communication time
Total time
Weak Scaling
Future Works
 Other schemes
 HLLD
 Other application
 Tsunami (shallow-water equation)
 Reaction-diffusion system
 Further performance improvement
Conclusion
 We have achieved 4.78 PFlops, 21.5% efficiency of peak
performance on the fluid simulation code on the large-
scale PEZY-SC2 based system
 We developed an automatic code generation framework
for TB, a scheme suitable for it and a backend for PEZY-
SC2 accelerator
 Our achieved efficiency is comparable to other works on
high B/F systems

Más contenido relacionado

La actualidad más candente

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticIeee Xpert
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...Ieee Xpert
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designIeee Xpert
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)IISRT
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...Ieee Xpert
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Lino Possamai
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition 艾鍗科技
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAJAYAPRAKASH JPINFOTECH
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERSLalitha Gosukonda
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Netronome
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISAHsien-Hsin Sean Lee, Ph.D.
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontrollerRouyun Pan
 

La actualidad más candente (20)

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 
B1030610
B1030610B1030610
B1030610
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless ApplicationDesign Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
 
Ad4103173176
Ad4103173176Ad4103173176
Ad4103173176
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
 

Similar a ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Joshua Mora
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGenevsachde
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodellingObsidian Software
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsWim Vanderbauwhede
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5Steen Larsen
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kitsohaibaslam207
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsHannes Tschofenig
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Codemotion
 

Similar a ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems (20)

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
7 eti pres
7 eti pres7 eti pres
7 eti pres
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
DSP_Assign_1
DSP_Assign_1DSP_Assign_1
DSP_Assign_1
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 

Más de Hideyuki Tanaka

Más de Hideyuki Tanaka (8)

Xpath in-lens
Xpath in-lensXpath in-lens
Xpath in-lens
 
IdrisでWebアプリを書く
IdrisでWebアプリを書くIdrisでWebアプリを書く
IdrisでWebアプリを書く
 
手書きスライド
手書きスライド手書きスライド
手書きスライド
 
Monad tutorial
Monad tutorialMonad tutorial
Monad tutorial
 
Yesod勉強会
Yesod勉強会Yesod勉強会
Yesod勉強会
 
C++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISるC++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISる
 
関数プログラミング入門
関数プログラミング入門関数プログラミング入門
関数プログラミング入門
 
Icfp2009
Icfp2009Icfp2009
Icfp2009
 

Último

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 

Último (20)

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

  • 1. Automatic Generation of High- Order Finite-Difference Code with Temporal Blocking For Extreme- Scale Many-Core Systems ESPM2 2018 Nov.12th, Dallas Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto, Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori, Miyuki Tsubouchi, Jun Makino
  • 2. Abstract  For an explicit finite-difference scheme applied to computation fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the large-scale PEZY-SC2 based system which has very low B/F by temporal blocking  The achieved efficiency is comparable to recent works on very high B/F systems  To achieve this high efficiency on a low B/F machine, we developed  A framework for explicit stencil computation which generates the boilerplate code for MPI and device kernel code with temporal blocking  A finite-difference scheme suitable for temporal blocking
  • 3. Table of Contents  Introduction  Explicit stencil computation  Temporal blocking  About PEZY-SC2  Details of our work  Code generation framework: Formura  Optimization for PEZY-SC2  Benchmark results  Performance on large-scale systems (Gyoukou)  Discussion and summary
  • 5. Explicit Stencil Computation  Explicit stencil computation is simple but very important application of HPC  It is used for simulating weather, earthquake, inside of the sun, etc.  Optimizing stencil computation is very important Source: Riken
  • 6. Efficiency of Recent Stencil Computation  Efficiency of explicit method on recent HPC hardware is not high enough  Even the best case efficiency on K computer is ≅ 20%, other many cases are only ≅ 10% (not high enough)  This low efficiency is caused by the problem in the architecture of processors, or memory bandwidth  We try to solve the problem of memory bandwidth which does not depend on architecture  In the past decades, B/F of HPC systems has been reduced dramatically  This trend seems likely to continue
  • 7. Relative Performance Trend  Green: FLOPS vs memory bandwidth (4.5x/decade)  Red: FLOPS vs network latency (~30x/decade)  This seems to continue Source: John D. McCalpin
  • 8. PEZY-SC2  Many core MIMD processor  1984 individual RISC cores  2.8TFlops peak DP performance (@700MHz)  4ch DDR4 DRAM, 64GB, 80GB/s  ⇒ B/F≒0.03  cf. K computer: 0.5 Tesla V100: 0.12 TaihuLight: 0.04
  • 9. PEZY-SC2 Architecture  The chip consists of 8 prefectures  Each prefecture contains 16 cities  Each city contains 16 processor elements (PEs)  8×16×16-64(redun- dancy) = 1984PEs  Each city shares L2 cache
  • 10. Gyoukou  Supercomputer installed at JAMSTEC, Japan  (Available until April 2018)  Peak 28.2PFlops (Full nodes)  Top500 4th (Nov 2017)  10000 PEZY-SC2s + 1250 Xeon D (1 for 8-SC2s)  World’s largest numbers of MIMD processor cores (≒ 20M cf. TaihuLight ≒ 11M)  Suitable for the test to check if the code can scale to exa-scale systems
  • 12. Temporal Blocking (TB)  One of the solution to explicit method on low B/F system  With TB, multiple timesteps are calculated for working array, so it can reduce required B/F when the working array fits to the processor cache memory size DRAM DRAM Cache Cache Cache Cache Computation for one time step Read from memory Write to memory Network Network ・・・
  • 13. Various Methods of TB A variation of this used for inter-node communication This is used for in-node computation
  • 14. Detail of Our TB Calculation  Inter-node communication  Each node sends data to one-direction  Each node receives data from one-direction  Simple communication-computation overwrapping
  • 15. Details of Our TB calculation  Computation starts from the right-most block  Upper-right of the parallelogram use dummy data to equalize all loop lengths  Gray part is unnecessary results  This method increases few computation
  • 16. SL4TH3 Scheme  Fourth order accuracy  Number of stencil = 2  Flop per cell per step ≒ 2800  Required B/F ≒ 0.05 w/o TB
  • 18. Formura: a Framework for TB  From a description of a stencil written in formura DSL, optimized distributed parallel codes for large scale parallel computers are generated  In this work, we add the support of the TB method for this work, and developed a device kernel code generator for PEZY-SC2 Formura DSL MPI driver code TB kernel code Executable formura gcc/mpicc formura
  • 19. Code generation by Formura Input Equation (Needs some more configuration files) (Very part of) Generated C Codes ・・・ Output (Zoomed-in)
  • 20. Code Generation  Formura generates:  Driver code for TB distributing on MPI  Optimized kernel codes for node-local computations  For new accelerator (or any other processors), we can add a backend by modifying the code that calculates temporal blocking steps  Typically, major optimizations for each device are blocking layout for data access locality and thread scheduling
  • 21. Optimization for PEZY-SC2  Decide the block size  Size of block is smaller than LLC size  Parallelism close to number of PEs for load-balancing  44× 44× 44 is the best block size  443×10(variables/cell)×8Byte = 6.50MB  6.50MB×2 (for overwrapping read and write)<32MB=LLC size  442 (parallelisms) = 1936≒ 1984(number of PEs)  Total 880(=44×20)3 cells per node < 64GB  Allocate adjacent cells in PEs which shares L2  Decrease inner-most loop instruction size  PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)  SL4TH3 requires 2800ops/cell
  • 23. Benchmark Results  Conditions  SL4TH3 scheme  Optimized backend for PEZY-SC2  8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou  ≒ 16M cores  Total (880×20=)17600 3 cells  Performance results  4.78 PFlops  21.5% efficiency (22.2PFlops theoretical-peak)
  • 24. Effect of Temporal Blocking NT Redundant calculation by TB Required B/F 1 1.4% 0.058 2 2.7% 0.029 3 4.0% 0.020 4 5.4% 0.015 5 6.7% 0.012 6 8.0% 0.010 7 9.2% 0.009 8 10.5% 0.008 Required B/F by NT (size per node = 8803) Calculation speed by NT GF NT Calculation speed by NT (= time step parameter)
  • 25. Comparison with Other Studies  This work achieves very high efficiency  Comparable result with very high B/F (= 0.5) system 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 Yashiro et. al. Yang et. al. Hotta et. al. This work B/F Efficiency(%) Efficiency 列 1 Device B/F
  • 26. Weak Scaling  The communication is completely hidden by computation  Thus even though the actual time for communication increase when we increase the number of nodes, weak scaling of the performance is pretty good Communication time Total time
  • 28. Future Works  Other schemes  HLLD  Other application  Tsunami (shallow-water equation)  Reaction-diffusion system  Further performance improvement
  • 29. Conclusion  We have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the fluid simulation code on the large- scale PEZY-SC2 based system  We developed an automatic code generation framework for TB, a scheme suitable for it and a backend for PEZY- SC2 accelerator  Our achieved efficiency is comparable to other works on high B/F systems