SlideShare una empresa de Scribd logo
1 de 38
Building Affordable and
Programmable Exascale Capable
Computers
SURFsara, 18 April 2019
Think Ångströms not nanometers in 2019
Ideally we should steer movements of almost each individual electron to solve our problems
0.1 nm  1 Å
14 nm  140 Å
DNA
C-C
bond
1Å 10Å 102Å 103Å 104Å
glucose
hemoglobin
ribosome
…
cells
100s of Si atoms in 14nm
very few atoms
(e.g., 3nm / 30Å  6 to 12 atoms)
light microscope
resolution
2
Moving data on-chip will use as much energy as computing with it
Moving data off-chip will use 200x more energy!
and is much slower as well
The power challenge
Today* 2020
Double precision Float Op ~20pJ <10pJ
Moving data on-chip: 1mm 6pJ
Moving data on-chip: 20mm 120pJ
Moving data to off-chip
memory
4,000pJ 2,000pJ
3
The data movement challenge
Next generation computer systems should take data movements at all levels very seriously
3
Data movement challenge confirmed
4
Wires that carry the data (and instructions) become more and more important
(Courtesy: NVIDIA and ITRS)
BUT
4
5
“… without dramatic increases in
efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or over
8% of the national capacity and
more than the daily entire usage
of Dublin. It will require 144 large
diesel generators as back up for
when the wind does not blow.”
Why is all of this important?
6
Computing in Time:
Follow a recipe step by step
one at the time
Computing in Space:
Build a “recipe specific” factory with multiple
paths performed simultaneously
One result per clock cycle
Efficient, predictable, reliable “mass production” of huge data amounts
Build Computers for your Problem and Data
1. Describe Conjugate Gradient
as dataflow graph
3. Stream data through the
Custom Accelerator
2. Compile dataflow structure and load to hardware
Create customized mega accelerators with massive inherent throughputs
Programming a Dataflow “mass production” Engine
7
Program in HL Language
Machine Architecture
Implementation
Circuits
Algorithm
Devices
Problem
Solutions
Solutions
Co-optimise the HW and the SW
stack for the performance critical
areas of the application
8
Solving Computing Problems Vertically
8
From Equations to Dataflow Hardware
u
x
s
x
vd
x
F
ah
p
ah
TRuu
a
vu
ah
u
t
u





















1ln

9
Real data flow graph as
generated by MaxCompiler
4,866 nodes;
10,000s of stages/cycles
Full Customization in:
Space, Value and Time
(SVT)
1010
Easy it is not (and not really new)
Slotnick’s law (of effort):
“The parallel approach to computing does require that
some original thinking be done about numerical
analysis and data management in order to secure
efficient use.
In an environment which has represented the
absence of the need to think as the highest virtue this
is a decided disadvantage.”
Daniel Slotnick (1931-1985)
Chief Architect of Illiac IV
11
Programing in Space basics
12
• Control and Data-flows are decoupled
– Both are fully programmable
• Operations exist in space and by default run in parallel
– Their number is limited only by the available space
• All operations can be customized at various levels
– e.g., from algorithm down to the number representation
• Multiple operations constitute kernels
• Data streams through the operations / kernels
• The data transport and processing can be balanced
• All resources work all of the time for max performance
• The In/Out data rates determine the operating frequency
Equally spread the available “forces” and move no faster than required by the application
12
The Computational Model
13
• Dataflow sub-system (DataFlow Engine- DFE)
– Spatial arithmetic chip “hardware” technology with flexible arithmetic units
and programmable interconnect (looks like FPGAs but is not limited to)
– Programmable Static Dataflow
– Systolic Execution at kernel level
– Streaming Custom Computing at system level
– Implicit GALS* IO and kernel-to-kernel communication
• Dedicated software (MaxCompiler, MaxelerOS and SLiC)
– compilation toolchain and design methodology
– Incorporated simulation and debug environment for rapid development
– Linux fully integrated runtime system and low level software support
– Help designer focus on the data/algorithm and the system architecture
• Only three basic memory types (explicitly exposed)
– Scalars (exposed to the CPU)
– Fast Memory (FMEM): small and fast (on-chip)
– Large Memory (LMEM): large and slow (off-chip)
* GALS – Globally Asynchronous Locally Synchronous
13
Maxeler’s DataFlow Engine (DFE, MAX4)
14
MaxRing
Interconnect
Dataflow Engine (DFE)LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
• 48GB DRAM (LMEM)
• Stratix V D8
• MaxRing interconnect
• 4,000 multipliers
• 700K logic cells
• 6.25MB of FMEM
14
Application Level Components
SLiC
MaxelerOS
Memory
CPU
DFE
Memory
Kernels (MaxJ)
(instantiate the arithmetic
structure)
*+
+
Manager (MaxJ)
(arrange the data
orchestration)
Host application (C, Python, Matlab..)
15
PCI Express
or
Infiniband
15
MaxJ: Moving Average of three numbers
Dataflow computing in hardware using a language you know
16
x
x
+
30
y
DFEVar x = io.input("x", dfeFloat(10,31));
DFEVar result = x * x + 30;
io.output("y", result, dfeFloat(10,31));
17
Simple example: y = x2 + 30
17
MaxJ example: Control in Space
18
x
+
1
y
-
1
>
10
class SimpleKernel extends Kernel {
SimpleKernel() {
DFEVar x = io.input(“x”, dfeInt(24));
DFEVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, dfeInt(25));
}
}
18
19
SIMULATE AND DEBUG
GENERATE DATAFLOWPROGRAMARCHITECTANALYSE
Used to build real systems, however, very difficult to learn/educate
Non Traditional Design Process
OK?many hours …
Custom
HW
Multiple scales of
computing
Important features for
optimization
complete system level  balance compute, storage
and IO
parallel node level  maximize utilization of
compute and interconnect
microarchitecture level  minimize data movement
arithmetic level  tradeoff range, precision
and accuracy
= discretize in Time, Space
and Value
bit level  encode and add
redundancy
transistor level => manipulate ‘0’ and ‘1’
and more, e.g., trade/hide Communication (Time) for/behind Computation (Space)
20
Optimizations at all levelsFlow/Time
Space
20
21
1. Higher chip / system price compared to microprocessors
2. Lead design times (3 months in the best case)
a. Complex numerical transformations
b. Non-trivial area and data movement optimizations
3. “Painful” Place & Route times (12 hours to 24 hours)
a. Expensive Vendor Specific tools
b. Serious developments ask for dedicated build clusters
4. Need to compete at 200MHz with processors at 3GHz
5. Current HW technology is sub-optimal
• On-chip memory not built for stream processing
• On-chip interconnect overdesigned for Dataflow
6. Long learning curve (Tools and Methods needed)
7. Designer’s productivity should improve (Tools and Methods)
8. …
Some of the challenges
Ongoing effort on improved methodologies and tools
MaxRing
Interconnect
Dataflow Engine (DFE)
LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
Multiple platforms, single DFE abstraction
+
{
Application and MaxJ
gen4 gen5
Performance Portable Migration
(Intel based) (Xilinx based)
22
• MaxCompiler generates VHDL
ready for FPGA vendor tools
• Synthesis transforms VHDL into
logical “netlist” – sets of basic logic
expressions
• Map fits basic logic into N-input
look-up tables
• Place puts LUTs, DSPs, RAMs etc
at specific locations on chip
• Route sets up wiring between
blocks
23
Substrate Agnostic Compilation
MaxCompiler compilation
Synthesis
Map
Place
Route
Generate Maxfile
VHDL
Complete FPGA
Netlist
LUTs
Placed FPGA
23
DFE Place and Route example
24
Mon 16:27: MaxCompiler version: 2012.2
Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013
Mon 16:27: Main build process running as user training1 on host Maxworkstation7478
Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel
Mon 16:27: Instantiating manager
Mon 16:27: Instantiating kernel “MyKernel"
Mon 16:27: Compiling manager (CPU I/O Only)
Mon 16:27: Compiling kernel "MyKernel"
Mon 16:27: Generating input files (VHDL, netlists, CoreGen)
Mon 16:27: Running back-end build (12 phases)
Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile)
Mon 16:27: (2/12) - Synthesize DFE Modules (XST)
Mon 16:30: (3/12) - Link DFE Modules (NGCBuild)
Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass)
Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code
Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter)
Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time.
Mon 16:30: (7/12) - Prepare for Placement (NGDBuild)
Mon 16:30: (8/12) - Place and Route DFE (MPPR)
Mon 16:30: Executing MPPR with 1 cost tables and 1 threads.
Mon 16:30: MPPR: Starting 1 cost table
Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0)
Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild)
Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass)
Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass)
Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile)
Mon 16:45:
Mon 16:45: FINAL RESOURCE USAGE
Mon 16:45: LUTs: 9503 / 149760 (6.35%)
Mon 16:45: FFs: 12749 / 149760 (8.51%)
Mon 16:45: BRAMs: 34 / 516 (6.59%)
Mon 16:45: DSPs: 0 / 1056 (0.00%)
Mon 16:45:
Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max
(MD5Sum: e564cd922aeeda04acfa2f4ecce8236d)
Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs)
FPGA vendor specific
back-end tool flow
Abstracted by MaxCompiler
24
• Allows you to see what lines of code are
• using what resources and focus optimization
• Separate reports for each kernel and for the manager
DFE Resource Usage Reporting
LUTs FFs BRAMs DSPs : MyKernel.java
727 871 1.0 2 : resources used by this file
0.24% 0.15% 0.09% 0.10% : % of available
71.41% 61.82% 100.00% 100.00% : % of total used
94.29% 97.21% 100.00% 100.00% : % of user resources
:
: public class MyKernel extends Kernel {
: public MyKernel (KernelParameters parameters) {
: super(parameters);
1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24));
2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8));
: DFEVar offset = io.scalarInput("offset", dfeUInt(8));
8 8 0.0 0 : DFEVar addr = offset + q;
18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr,
: dfeFloat(8,24), 256);
139 145 0.0 2 : p = p * p;
401 541 0.0 0 : p = p + v;
: io.output("r", p, dfeFloat(8,24));
: }
: }
DSP Blocks
Block RAMs
IO Blocks
LUT/FFs
? ?
Different operations
use different
resources
25
• MaxCompiler gives detailed latency and area annotation
back to the programmer
• Evaluate precise effect of code
on latency and chip area
26
Optimization Feedback
12.8ns 6.4ns+ = 19.2ns (total compute latency)
26
27
Small pilot system deployed in Oct 2017
• one 1U MPC-X with 8 MAX5 DFEs
• one 1U AMD EPYC based server
• one 1U login head node
Scaling using Amazon AWS cloud
• MAX5 fully compatible with F1 instances
• Elastic scaling between private and public
MPC-X node
Remote users
MAX5 DFE EPYC CPU
1TB DDR4
Head/Build node
ipmi
56 Gbps 2x Infiniband @ 56 Gbps
10 Gbps
10 Gbps
Supermicro EPYC node
Pilot System Deployed at Jülich
http://www.prace-ri.eu/pcp/
27
28
PRACE-PCP: SpecFEM3D on DFE
28
The BQCD Chip - AERIAL VIEW
Scalable Conjugate Gradient Design for the CG step of BQCD
Problem
(Small/Large)
System (composition) (size) TTS
[sec]
ETS
[kWh]
DTS
(F1)
BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 -
64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93
NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 -
GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72
SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 -
6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2
QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 -
Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16
Achieved Performance PRACE workloads
30
Global Weather Simulation with DFEs in China
⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X.
Huang, Y. Zhang, and G. Yang, Accelerating
solvers for global atmospheric equations
through mixed-precision data flow engine,
published at FPL 2013
⬥Joint research with Imperial College and
Tsinghua University
⬥Simulating the atmosphere using the
shallow water equation
An order of magnitude improvement over the Linpack-driven supercomputer technology
Platform Speedup Efficiency
6 Core CPU 1x 1x
Tianhe-1A Node 23x 15x
Maxeler MPC-X 330x 145x
31
32
• A (fancy) name does not help with solving the problem at hand
• Cloud, (Intelligent) Edge, Fog are just names like … Maria
• FPGA is just a technology that can help bridging the gap to something
better (Spatial Computing Acceleration HW, Quantum Processing, …)
• just focus on building the best computer for the given job
• Learn, think, pioneer and stay always critical
• abstraction is powerful but quite often not needed
• use it with great care and remember Dan Slotnick
• We are turning Earth into a heterogeneous, planet-wide computer
• so we should try to not kill it in the process
• There is a lot of interest in this topic 
Conclusions
?
Questions
Contact me at: georgi@maxeler.com
Or find me on Google: “Georgi computer” should do
33
Some links with more information
Maxeler Multiscale Dataflow Computing:
https://www.maxeler.com/technology/dataflow-computing/
Computing in Space explained by Mike Flynn:
http://www.openspl.org/what-is-openspl/
Computing in Space Course at Imperial College:
http://cc.doc.ic.ac.uk/openspl16/
Exciting Applications for DFEs (and JDFEs):
http://appgallery.maxeler.com
Maxeler DFEs on AWS EC2 F1:
https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6-
c66ab2ba202a
Maxeler and Xilinx Alveo collaboration:
https://www.xilinx.com/products/boards-and-kits/alveo.html
34
Maxeler Applications Gallery
Dataflow Engine (DFE) Ecosystem
⬥ With over 150 universities in our university program, we
decided to create an app gallery to enable the community
to share applications, examples, demos, …
⬥ The App Gallery is complemented by a teaching program,
with the first successful course taught at Imperial College in
2014. see
http://cc.doc.ic.ac.uk/openspl14
⬥ Top 10 APPS:
➢ Correlation: in real-time, pairwise, on 6,000 streams
➢ 100% Guaranteed Packet Capture
➢ Webserver, cache and load balancing
➢ HESTON Option pricer
➢ N-body simulation
➢ Regex matching (e.g. for Security)
➢ Brain network simulation
➢ Quantum Chromo-Dynamics kernel
➢ Seismic Imaging
➢ Realtime Classification
Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/
35
Peer Reviewed Dataflow Publications
2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic
Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008.
2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance
Technology Award, with JP Morgan.
2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for
Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger,
IEEE Micro, vol. 31, no. 2, March/April 2011.
2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area
Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle
Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC.
2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL
industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open
Markets Magazine, Ari Studnitzer, CME Group.
2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow
Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis,
Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497
2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis
of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation,
Volume 12, IOP Publishing.
36
Maxeler University Program
37
Maxeler Trophy Cabinet
Academic History since 2005
• Imperial College Research Excellence Award
• Top EPSRC Advanced Fellowship
• Two Best Paper Awards
• Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most
influential papers at the FPL conference in the last 25 years.
Recent Commercial Awards
• HPCwire Editors Choice Award, November 2011.
• American Finance Technology Awards, New York, winner,
“Most Cutting Edge IT Initiative,” December 2011.
• Golden Arrow, “...for revolutionizing Computers, ”
COM-SULT, January 2012.
• Gartner “Cool Vendor of the Year,” March 2012.
• Frost and Sullivan “Most innovative IT vendor, ” Dec 2013.
• CIO Review, 20 Most Promising Networking Companies, March 2014.
• CIO Review, 20 Most Promising HPC Companies, March 2015.
38

Más contenido relacionado

La actualidad más candente

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraRyousei Takano
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM Research
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Intel® Software
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 

La actualidad más candente (20)

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 

Similar a Exascale Capabl

DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptMohmdUmer
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster ComputingNIKHIL NAIR
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05Rajesh Gupta
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitDon Marti
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 

Similar a Exascale Capabl (20)

DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Par com
Par comPar com
Par com
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Exascale Capabl

  • 1. Building Affordable and Programmable Exascale Capable Computers SURFsara, 18 April 2019
  • 2. Think Ångströms not nanometers in 2019 Ideally we should steer movements of almost each individual electron to solve our problems 0.1 nm  1 Å 14 nm  140 Å DNA C-C bond 1Å 10Å 102Å 103Å 104Å glucose hemoglobin ribosome … cells 100s of Si atoms in 14nm very few atoms (e.g., 3nm / 30Å  6 to 12 atoms) light microscope resolution 2
  • 3. Moving data on-chip will use as much energy as computing with it Moving data off-chip will use 200x more energy! and is much slower as well The power challenge Today* 2020 Double precision Float Op ~20pJ <10pJ Moving data on-chip: 1mm 6pJ Moving data on-chip: 20mm 120pJ Moving data to off-chip memory 4,000pJ 2,000pJ 3 The data movement challenge Next generation computer systems should take data movements at all levels very seriously 3
  • 4. Data movement challenge confirmed 4 Wires that carry the data (and instructions) become more and more important (Courtesy: NVIDIA and ITRS) BUT 4
  • 5. 5 “… without dramatic increases in efficiency, ICT industry could use 20% of all electricity and emit up to 5.5% of the world’s carbon emissions by 2025.” “We have a tsunami of data approaching. Everything which can be is being digitalised. It is a perfect storm.” “ … a single $1bn Apple data centre planned for Athenry in Co Galway, expects to eventually use 300MW of electricity, or over 8% of the national capacity and more than the daily entire usage of Dublin. It will require 144 large diesel generators as back up for when the wind does not blow.” Why is all of this important?
  • 6. 6 Computing in Time: Follow a recipe step by step one at the time Computing in Space: Build a “recipe specific” factory with multiple paths performed simultaneously One result per clock cycle Efficient, predictable, reliable “mass production” of huge data amounts Build Computers for your Problem and Data
  • 7. 1. Describe Conjugate Gradient as dataflow graph 3. Stream data through the Custom Accelerator 2. Compile dataflow structure and load to hardware Create customized mega accelerators with massive inherent throughputs Programming a Dataflow “mass production” Engine 7
  • 8. Program in HL Language Machine Architecture Implementation Circuits Algorithm Devices Problem Solutions Solutions Co-optimise the HW and the SW stack for the performance critical areas of the application 8 Solving Computing Problems Vertically 8
  • 9. From Equations to Dataflow Hardware u x s x vd x F ah p ah TRuu a vu ah u t u                      1ln  9
  • 10. Real data flow graph as generated by MaxCompiler 4,866 nodes; 10,000s of stages/cycles Full Customization in: Space, Value and Time (SVT) 1010
  • 11. Easy it is not (and not really new) Slotnick’s law (of effort): “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” Daniel Slotnick (1931-1985) Chief Architect of Illiac IV 11
  • 12. Programing in Space basics 12 • Control and Data-flows are decoupled – Both are fully programmable • Operations exist in space and by default run in parallel – Their number is limited only by the available space • All operations can be customized at various levels – e.g., from algorithm down to the number representation • Multiple operations constitute kernels • Data streams through the operations / kernels • The data transport and processing can be balanced • All resources work all of the time for max performance • The In/Out data rates determine the operating frequency Equally spread the available “forces” and move no faster than required by the application 12
  • 13. The Computational Model 13 • Dataflow sub-system (DataFlow Engine- DFE) – Spatial arithmetic chip “hardware” technology with flexible arithmetic units and programmable interconnect (looks like FPGAs but is not limited to) – Programmable Static Dataflow – Systolic Execution at kernel level – Streaming Custom Computing at system level – Implicit GALS* IO and kernel-to-kernel communication • Dedicated software (MaxCompiler, MaxelerOS and SLiC) – compilation toolchain and design methodology – Incorporated simulation and debug environment for rapid development – Linux fully integrated runtime system and low level software support – Help designer focus on the data/algorithm and the system architecture • Only three basic memory types (explicitly exposed) – Scalars (exposed to the CPU) – Fast Memory (FMEM): small and fast (on-chip) – Large Memory (LMEM): large and slow (off-chip) * GALS – Globally Asynchronous Locally Synchronous 13
  • 14. Maxeler’s DataFlow Engine (DFE, MAX4) 14 MaxRing Interconnect Dataflow Engine (DFE)LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links • 48GB DRAM (LMEM) • Stratix V D8 • MaxRing interconnect • 4,000 multipliers • 700K logic cells • 6.25MB of FMEM 14
  • 15. Application Level Components SLiC MaxelerOS Memory CPU DFE Memory Kernels (MaxJ) (instantiate the arithmetic structure) *+ + Manager (MaxJ) (arrange the data orchestration) Host application (C, Python, Matlab..) 15 PCI Express or Infiniband 15
  • 16. MaxJ: Moving Average of three numbers Dataflow computing in hardware using a language you know 16
  • 17. x x + 30 y DFEVar x = io.input("x", dfeFloat(10,31)); DFEVar result = x * x + 30; io.output("y", result, dfeFloat(10,31)); 17 Simple example: y = x2 + 30 17
  • 18. MaxJ example: Control in Space 18 x + 1 y - 1 > 10 class SimpleKernel extends Kernel { SimpleKernel() { DFEVar x = io.input(“x”, dfeInt(24)); DFEVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, dfeInt(25)); } } 18
  • 19. 19 SIMULATE AND DEBUG GENERATE DATAFLOWPROGRAMARCHITECTANALYSE Used to build real systems, however, very difficult to learn/educate Non Traditional Design Process OK?many hours … Custom HW
  • 20. Multiple scales of computing Important features for optimization complete system level  balance compute, storage and IO parallel node level  maximize utilization of compute and interconnect microarchitecture level  minimize data movement arithmetic level  tradeoff range, precision and accuracy = discretize in Time, Space and Value bit level  encode and add redundancy transistor level => manipulate ‘0’ and ‘1’ and more, e.g., trade/hide Communication (Time) for/behind Computation (Space) 20 Optimizations at all levelsFlow/Time Space 20
  • 21. 21 1. Higher chip / system price compared to microprocessors 2. Lead design times (3 months in the best case) a. Complex numerical transformations b. Non-trivial area and data movement optimizations 3. “Painful” Place & Route times (12 hours to 24 hours) a. Expensive Vendor Specific tools b. Serious developments ask for dedicated build clusters 4. Need to compete at 200MHz with processors at 3GHz 5. Current HW technology is sub-optimal • On-chip memory not built for stream processing • On-chip interconnect overdesigned for Dataflow 6. Long learning curve (Tools and Methods needed) 7. Designer’s productivity should improve (Tools and Methods) 8. … Some of the challenges Ongoing effort on improved methodologies and tools
  • 22. MaxRing Interconnect Dataflow Engine (DFE) LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links Multiple platforms, single DFE abstraction + { Application and MaxJ gen4 gen5 Performance Portable Migration (Intel based) (Xilinx based) 22
  • 23. • MaxCompiler generates VHDL ready for FPGA vendor tools • Synthesis transforms VHDL into logical “netlist” – sets of basic logic expressions • Map fits basic logic into N-input look-up tables • Place puts LUTs, DSPs, RAMs etc at specific locations on chip • Route sets up wiring between blocks 23 Substrate Agnostic Compilation MaxCompiler compilation Synthesis Map Place Route Generate Maxfile VHDL Complete FPGA Netlist LUTs Placed FPGA 23
  • 24. DFE Place and Route example 24 Mon 16:27: MaxCompiler version: 2012.2 Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013 Mon 16:27: Main build process running as user training1 on host Maxworkstation7478 Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel Mon 16:27: Instantiating manager Mon 16:27: Instantiating kernel “MyKernel" Mon 16:27: Compiling manager (CPU I/O Only) Mon 16:27: Compiling kernel "MyKernel" Mon 16:27: Generating input files (VHDL, netlists, CoreGen) Mon 16:27: Running back-end build (12 phases) Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile) Mon 16:27: (2/12) - Synthesize DFE Modules (XST) Mon 16:30: (3/12) - Link DFE Modules (NGCBuild) Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass) Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter) Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time. Mon 16:30: (7/12) - Prepare for Placement (NGDBuild) Mon 16:30: (8/12) - Place and Route DFE (MPPR) Mon 16:30: Executing MPPR with 1 cost tables and 1 threads. Mon 16:30: MPPR: Starting 1 cost table Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0) Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild) Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass) Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass) Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile) Mon 16:45: Mon 16:45: FINAL RESOURCE USAGE Mon 16:45: LUTs: 9503 / 149760 (6.35%) Mon 16:45: FFs: 12749 / 149760 (8.51%) Mon 16:45: BRAMs: 34 / 516 (6.59%) Mon 16:45: DSPs: 0 / 1056 (0.00%) Mon 16:45: Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max (MD5Sum: e564cd922aeeda04acfa2f4ecce8236d) Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs) FPGA vendor specific back-end tool flow Abstracted by MaxCompiler 24
  • 25. • Allows you to see what lines of code are • using what resources and focus optimization • Separate reports for each kernel and for the manager DFE Resource Usage Reporting LUTs FFs BRAMs DSPs : MyKernel.java 727 871 1.0 2 : resources used by this file 0.24% 0.15% 0.09% 0.10% : % of available 71.41% 61.82% 100.00% 100.00% : % of total used 94.29% 97.21% 100.00% 100.00% : % of user resources : : public class MyKernel extends Kernel { : public MyKernel (KernelParameters parameters) { : super(parameters); 1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24)); 2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8)); : DFEVar offset = io.scalarInput("offset", dfeUInt(8)); 8 8 0.0 0 : DFEVar addr = offset + q; 18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr, : dfeFloat(8,24), 256); 139 145 0.0 2 : p = p * p; 401 541 0.0 0 : p = p + v; : io.output("r", p, dfeFloat(8,24)); : } : } DSP Blocks Block RAMs IO Blocks LUT/FFs ? ? Different operations use different resources 25
  • 26. • MaxCompiler gives detailed latency and area annotation back to the programmer • Evaluate precise effect of code on latency and chip area 26 Optimization Feedback 12.8ns 6.4ns+ = 19.2ns (total compute latency) 26
  • 27. 27 Small pilot system deployed in Oct 2017 • one 1U MPC-X with 8 MAX5 DFEs • one 1U AMD EPYC based server • one 1U login head node Scaling using Amazon AWS cloud • MAX5 fully compatible with F1 instances • Elastic scaling between private and public MPC-X node Remote users MAX5 DFE EPYC CPU 1TB DDR4 Head/Build node ipmi 56 Gbps 2x Infiniband @ 56 Gbps 10 Gbps 10 Gbps Supermicro EPYC node Pilot System Deployed at Jülich http://www.prace-ri.eu/pcp/ 27
  • 29. The BQCD Chip - AERIAL VIEW Scalable Conjugate Gradient Design for the CG step of BQCD
  • 30. Problem (Small/Large) System (composition) (size) TTS [sec] ETS [kWh] DTS (F1) BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 - 64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93 NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 - GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72 SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 - 6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2 QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 - Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16 Achieved Performance PRACE workloads 30
  • 31. Global Weather Simulation with DFEs in China ⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, published at FPL 2013 ⬥Joint research with Imperial College and Tsinghua University ⬥Simulating the atmosphere using the shallow water equation An order of magnitude improvement over the Linpack-driven supercomputer technology Platform Speedup Efficiency 6 Core CPU 1x 1x Tianhe-1A Node 23x 15x Maxeler MPC-X 330x 145x 31
  • 32. 32 • A (fancy) name does not help with solving the problem at hand • Cloud, (Intelligent) Edge, Fog are just names like … Maria • FPGA is just a technology that can help bridging the gap to something better (Spatial Computing Acceleration HW, Quantum Processing, …) • just focus on building the best computer for the given job • Learn, think, pioneer and stay always critical • abstraction is powerful but quite often not needed • use it with great care and remember Dan Slotnick • We are turning Earth into a heterogeneous, planet-wide computer • so we should try to not kill it in the process • There is a lot of interest in this topic  Conclusions
  • 33. ? Questions Contact me at: georgi@maxeler.com Or find me on Google: “Georgi computer” should do 33
  • 34. Some links with more information Maxeler Multiscale Dataflow Computing: https://www.maxeler.com/technology/dataflow-computing/ Computing in Space explained by Mike Flynn: http://www.openspl.org/what-is-openspl/ Computing in Space Course at Imperial College: http://cc.doc.ic.ac.uk/openspl16/ Exciting Applications for DFEs (and JDFEs): http://appgallery.maxeler.com Maxeler DFEs on AWS EC2 F1: https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6- c66ab2ba202a Maxeler and Xilinx Alveo collaboration: https://www.xilinx.com/products/boards-and-kits/alveo.html 34
  • 35. Maxeler Applications Gallery Dataflow Engine (DFE) Ecosystem ⬥ With over 150 universities in our university program, we decided to create an app gallery to enable the community to share applications, examples, demos, … ⬥ The App Gallery is complemented by a teaching program, with the first successful course taught at Imperial College in 2014. see http://cc.doc.ic.ac.uk/openspl14 ⬥ Top 10 APPS: ➢ Correlation: in real-time, pairwise, on 6,000 streams ➢ 100% Guaranteed Packet Capture ➢ Webserver, cache and load balancing ➢ HESTON Option pricer ➢ N-body simulation ➢ Regex matching (e.g. for Security) ➢ Brain network simulation ➢ Quantum Chromo-Dynamics kernel ➢ Seismic Imaging ➢ Realtime Classification Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/ 35
  • 36. Peer Reviewed Dataflow Publications 2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008. 2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance Technology Award, with JP Morgan. 2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger, IEEE Micro, vol. 31, no. 2, March/April 2011. 2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC. 2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open Markets Magazine, Ari Studnitzer, CME Group. 2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis, Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497 2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation, Volume 12, IOP Publishing. 36
  • 38. Maxeler Trophy Cabinet Academic History since 2005 • Imperial College Research Excellence Award • Top EPSRC Advanced Fellowship • Two Best Paper Awards • Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most influential papers at the FPL conference in the last 25 years. Recent Commercial Awards • HPCwire Editors Choice Award, November 2011. • American Finance Technology Awards, New York, winner, “Most Cutting Edge IT Initiative,” December 2011. • Golden Arrow, “...for revolutionizing Computers, ” COM-SULT, January 2012. • Gartner “Cool Vendor of the Year,” March 2012. • Frost and Sullivan “Most innovative IT vendor, ” Dec 2013. • CIO Review, 20 Most Promising Networking Companies, March 2014. • CIO Review, 20 Most Promising HPC Companies, March 2015. 38