3. 2016-10-19 Roberto Innocente inno@sissa.it 3
Table of Contents
1. Project history
2. What is an FPGA ?
3. INTEL/Altera Arria 10
4. 7 Dwarfs
5. Arithmetic Intensity(AI)
6. Roofline Model
7. CUDA/OpenCL
8.Actual Performance
9.OpenCL for FPGA
10.Getting the most
11.Schematics
12.HDL for FPGA
13.Spatial Computing(SC)
14.What next ?
15.. Competitors
16.Can I use it ?
4. 2016-10-19 Roberto Innocente inno@sissa.it 4
FPGA Computing project
● Project proposed at the beginning of 2014 :
– http://people.sissa.it/~inno/pubs/reconfig-computing
-16-9-tris.pdf
● Nallatech board with Arria 10 FPGA ordered
end of April 2016-04-22
● Nallatech Board arrives 2016-06-24
● Troubles with software licenses solved mid of
august 2016-08-14
6. 2016-10-19 Roberto Innocente inno@sissa.it 6
What is an FPGA ?
●
FPGA (acronym of Field Programmable Gate Array ) is a misnomer
(gates in digital electronics are very simple circuits like: and, or, not,
xor,..)
●
It is in fact an array of Configurable Logic Blocks (CLB : 6/7/8
inputs, output can be any boolean function over them or 2/3 subsets
of them)
●
A “blank slate” in which you have to program both the functions
that the Logic Blocks perform and the interconnections
between them
●
Today some of the LB, to be more efficient, are specialized (Memory
Blocks, DSP1 blocks, I/O blocks,...)
1) DSP = Digital Signal Processor (Multiplier/Adder)
8. 2016-10-19 Roberto Innocente inno@sissa.it 8
Scalar Product on an FPGA
x[0]
*
x[1]
*
x[2]
*
x[3]
*
y[0] y[1] y[2] y[3]
+ +
+
x . y = Sum x[i]*y[i]
DFG = Data Flow Graph
While with other
architectures you need to
adapt your program to the
architecture, with FPGA you
adapt the architecture to
your program.
Each cycle
a new result
after 7 flops
10. 2016-10-19 Roberto Innocente inno@sissa.it 10
INTEL/Altera Arria 10
● This was the first FPGA on the market to offer
native floating point multiply/add in its DSPs.
● That's the reason why we bought it.
● Of course also on the other large FPGAs you can
implement floating point ops if you want, using the
IP cores offered by vendors.
NB. IP core : a function implemented in schematics
or an HDL not free, but proprietary (IP = Intellectual
Property)
11. 2016-10-19 Roberto Innocente inno@sissa.it 11
INTEL (Altera) Arria 10
● INTEL Arria 10 GX1150 :
– Logic Elements 1,150 K
– ALMs 427,200
– Registers 1,708,800
– M20K mem block 2,713
– DSP 1,518 (integer and float SP)
● Back of the envelope calculation : each DSP can
output a SinglePrecision Fused Multiply Add per
cycle
– 2 × 1518 = 3036 flops × 0.5Ghz = 1500Gflop/ s
INTEL bought Altera in 2015 and
now they start to re-brand everything.
To avoid short-term obsolescence
I will call it INTEL FPGA
14. 2016-10-19 Roberto Innocente inno@sissa.it 14
The “Seven dwarfs”
At the dawn of the many core and heterogenous new computer architectures, Phil Colella of LBL, wrote the
presentation Defining Software Requirements for Scientific Computing, in which he claimed that all new
architectures should measure themselves with seven computational kernels common across every branch of
scientific computing.
These computational kernels were later cosy-named Seven Dwarfs, because like in the SnowWhite fairy tale
they should be mining for gold in new Computer Architectures.
“A dwarf is an algorithmic method that captures a pattern of computation and communication. “
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine
The dwarfs grew with time to 13.
High-end simulation in the physical sciences consists of seven algorithms:
• Structured Grids (stencils, including locally structured grids, e.g. AMR)
• Unstructured Grids
• Fast Fourier Transform
• Dense Linear Algebra
• Sparse Linear Algebra
• Particles
• Monte Carlo
Phil Colella,2004 (LBL)
19. 2016-10-19 Roberto Innocente inno@sissa.it 19
The Roofline Model
Sam Williams; A.Waterman;D. Patterson
(2009-04-01). "Roofline: An Insightful Visual
Performance Model for Multicore Architectures"
http://doi.acm.org/10.1145/1498765.1498785
● It is an intuitive visual performance model to
provide estimates of performance
● Based on two ceilings :
– Peak Flop performance of the architecture
– Maximum throughput of offchip memory
20. 2016-10-19 Roberto Innocente inno@sissa.it 20
Arria10 Roofline Model
Break Even Point
@ AI = 250
0.01
0.1
1
10
100
1000
0.01 0.1 1 10 100 1000
AttainableGfop/s
Arithmetic intensity (AI) fops/byte transferred
Roofine model
Limits are I/O bandwidth 6GB/s, Peak fops 1.5 Tfops
1/12vectoradd
1/4scalarproduct
3/8SAXPY
1/2Vectormagnitude
5/24Stencil4neigh
Theoretical Peak (x<250)? 6*x:1500
22. 2016-10-19 Roberto Innocente inno@sissa.it 22
Data parallelism / Task parallelism
From
www.fixtars.com
Data Parallel Task Parallel
T
A
S
K
0
..
¼ N
¼ N
..
½ N
½ N
..
¾ N
¾ N
..
N
T
A
S
K
T
A
S
K
T
A
S
K
SUM SUM
23. 2016-10-19 Roberto Innocente inno@sissa.it 23
The rise of the CUDA/OpenCL model
● In the mid of the past decade it was clear that Moore law could be respected only
through parallelism. ManyCore and Heterogenous computers appeared: GPUs,
FPGAs, CPUs, DSPs
●
GPUs with hundredths and then thousands of simple cores (forthcoming NVIDIA
pascal ~ 3.800 [available from 2017] )
● Data parallelism can be supported with a simple model (differently from task
parallelism) : a compute pattern (kernel) instantiated on every core with a
different set of indices.
– a[i]+b[i] (Vector Addition kernel)
– Σk a[i,k] * b[k,j] (Matrix multiplication kernel)
– 1 /2 /3 dimensional NDRange / grid
● Each instantiation (work-item/thread) is provided with different parameters
through a function call (e.g. get_global_id() , in fact the core computes
displacements by itself knowing its wg and wi numbers )
24. 2016-10-19 Roberto Innocente inno@sissa.it 24
NVIDIA pascal / GP100
● GP100 (device) : ● SM (compute unit) :
2 x vector processors 32 SIMD
(because only 1 PC per warp)
25. 2016-10-19 Roberto Innocente inno@sissa.it 25
OpenCL for FPGAs
● There is a compiler front end (UIUC LLVM) for the HDL
Place&Route (PAR) package (in the INTEL/Altera case Quartus
Pro)
● For the FPGAs the compilers are all offline compilers. Why ?
– It takes many hours or days of CPU to synthesize a complete project
– Forget about Apple/NVIDIA examples in which OpenCL code is a string
inside the host C++ program.
– INTEL/Altera say you need 32 GB of main memory, but in fact I have
seen the compilation processes to use 40/50 GB many times (so 64 GB is
a better size).
● aoc : INTEL/Altera Offline Compiler :
– aoc krnl.cl -o krnl.aocx
27. 2016-10-19 Roberto Innocente inno@sissa.it 27
FPGA/OpenCL
● OpenCL was born for different computer
architectures and doesnt capture all
possibilities FPGAs can offer.
● Anyway OpenCL for FPGA seems a mature
product that offers a big step up in easy to
obtain FPGAs performance.
29. 2016-10-19 Roberto Innocente inno@sissa.it 29
Results Reported
● All the results reported here were obtained
using INTEL/Altera OpenCL compiler 16.0.0
Build 211 and same version Quartus Pro
● In a future report I will discuss Verilog results.
30. 2016-10-19 Roberto Innocente inno@sissa.it 30
Vector Addition
● z[i] = x[i] + y[i]
● Computational
intensity very low :
–
● Limit is then from I/O:
– 6 GB* 1/12 = 0.5
Gflops/s
./vector_add
Initializing OpenCL
Platform: Altera SDK for OpenCL
Performance on CPU 1 core of intel i7 :
Processing time on CPU = 1.1313ms
Mflops/s 883.948201
Launching for device 0 (1000000 elements)
Performance on FPGA :
Processing time on FPGA = 6.5348ms
Mflop/s on FPGA= 153.027972
Time: 6.535 ms
Kernel time (device 0): 3.668 ms
AI=
1
12
31. 2016-10-19 Roberto Innocente inno@sissa.it 31
Stencil code
From PDE
Substitute derivatives with discrete
approximation using a symbolic
algebra package
Difference Equation
Stencil code
Code that updates a point using the
neighbor point values
3D stencil of order 8
32. 2016-10-19 Roberto Innocente inno@sissa.it 32
Stencil code/2
●
5043 = 0.128 G points in lattice
● 5 time steps :
– 0.128 * 5 = 0.640 G points processed
● 321 ms :
– 0.640/.321 = 1.993 Gpoints/s processed
● 24 neighbors + 1 = 25 * 2 ops =
– 50 ops
● 1.993 Gpoints/s * 50 flops =
– 99 Gflops/s on FPGA
● On a single core of intel i7 cpu :
– 0.85 Gflop/s
● Arithmetic Intensity :
–
$ ./stencil
Volume size: 504 x 504 x 504
order8 stencil computation for 5 time steps
Performance on FPGA :
Processing time : 321 ms
Throughput = 1.9897 Gpoints / sec
Gflops per second 99.486999
Performance on CPU intel i7 1 core :
Processing time on cpu = 37524.9531ms
Throughput on cpu = 0.0171 Gpoints / sec
Gflops per second on cpu 0.852926
Verifying data > PASSED
CI=
2×25×N3
4×25×N3
=
1
2
… but ..
33. 2016-10-19 Roberto Innocente inno@sissa.it 33
Matrix Multiplication
● Matrix sizes:
– A: 2048 x 1024
– B: 1024 x 1024
– C: 2048 x 1024
● FPGA 128.77 Gflops
● CPU 1.48 Gflops (on
1 core of Intel i7)
●
Generating input matrices
Launching for device 0 (global size:
1024, 2048)
Performance of FPGA :
Time: 33.353 ms
Kernel time (device 0): 33.294 ms
Throughput: 128.77 GFLOPS
Computing reference output
Performance of CPU Intel i7 single core :
Time: 2907.730 ms
Throughput: 1.48 GFLOPS
AI=
N2
×(2×N−1)
4×3×N
2
=
1
6
… but ..
34. 2016-10-19 Roberto Innocente inno@sissa.it 34
A
B
CTile
Find 2 slices of A rows and B cols that you can
keep in fast memory, then you can compute the
corresponding tile of C without accessing any
other data (data re-use due to caching). This
can increase a lot the Arithmetic Intensity.
More efficient Matrix Multiplication
If you can store
stripes large k,
then you read B
only once and A
N/k times.
AI=
2N k2 N2
k2
N
2
+
N
k
×N
2
=
2N3
N
2
(1+
N
k
)
=
2
1
k
+
1
N
≈2k
37. 2016-10-19 Roberto Innocente inno@sissa.it 37
FFT 1d
● AI ~ 7.48
● FPGA ~ 120 Gflop/s
Fixed 4k points transform
Launching FFT transform for 2000 iterations
FFT kernel initialization is complete.
Processing time = 4.0878ms
Throughput = 2.0040 Gpoints / sec (120.2420 Gflops)
Signal to noise ratio on output sample: 137.677661
> PASSED
Launching inverse FFT transform for 2000 iterations
Inverse FFT kernel initialization is complete.
Processing time = 4.0669ms
Throughput = 2.0143 Gpoints / sec (120.8579 Gflops)
Signal to noise ratio on output sample: 137.041007
> PASSED
AI=
5∗N∗log(N)/log(2)
4∗2∗N
5
8
∗
log(N)
log(2)
… but ..
38. 2016-10-19 Roberto Innocente inno@sissa.it 38
FFT 2d
● ~ 66 Gflop/s
●
Launching FFT transform (alternative data layout)
Kernel initialization is complete.
Processing time = 1.5787ms
Throughput = 0.6642 Gpoints / sec (66.4201 Gflops)
Signal to noise ratio on output sample: 137.435876
> PASSED
Launching inverse FFT transform (alternative data
layout)
Kernel initialization is complete.
Processing time = 1.5781ms
Throughput = 0.6644 Gpoints / sec (66.4440 Gflops)
Signal to noise ratio on output sample: 136.689050
> PASSED
AI=
5∗N2
∗log(N2
)
4∗2∗N
2
∗log(2)
5∗2∗log(N)
4∗2∗log(2)
=
5
4
∗
log(N)
log(2)
39. 2016-10-19 Roberto Innocente inno@sissa.it 39
Computing π with Montecarlo
Computes π with a Mersenne
twister rng.
Points = 222
GlobalWS=WG 32 , LocalWS=WI 32
I. 32x32 WI = 1024 generate 4096
rn in [0,1]x[0,1] = 222= 4194304
II.For each batch of 4096 rn
computes ins and outs respect to
the circle
III.Computes average
It takes ~ 854 ns for each rn
Using AOCX: mt.aocx
Reprogramming device with handle 1
Count all 26354932.000000 / ( rn
4096 * rng 32 *32) * 4
Computed pi = 3.141753
Mersenne twister : 1954.146849[ms]
Computing pi : 1632.340594[ms]
Copy results : 0.077611[ms]
Total time : 3586.565054[ms]
40. 2016-10-19 Roberto Innocente inno@sissa.it 40
Sobel filter
● 1920 x 1080 pixels image, 3 x 8 planes
color ~ 6 MB
●
Filter can be applied 140 fps
●
luma=(([R G B]
[
66
129
25 ]+128)≫8)+16 RecBT709
Sobel Operators Sx=
[
−1
−2
−1
0
0
0
1
2
1]Sy=
[
−1
0
+1
−2
0
+2
−1
0
+1]∂I
∂x
=I∗Sx ,
∂I
∂ y
=I∗Sy
∇I=[∂I
∂x
∂I
∂ y ], ‖∇ I‖=
√(∂I
∂ x)
2
+(∂ I
∂ y)
2
‖∇ I (i , j)‖ < θ → pixel(i, j)=(0,0,0)
Convolution
41. 2016-10-19 Roberto Innocente inno@sissa.it 41
Other implementations
● Smith-Waterman
– Algorithm for
computing the best
match (with gaps and
mismatches) between
2 DNA sequences
Status : in progress
● Spiking Neurons
– McCulloch-Pitts (and later
Rosenblatt perceptron) are
too simple models of
neuron communication. In
fact neurons for sure use
spikes frequency to signal
strength of activation or
maybe even use spikes as
a kind of binary code
between them
Status: thought about it
43. 2016-10-19 Roberto Innocente inno@sissa.it 43
OpenCL
https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf
Originally authored by Apple, bored by the need to support all the new coming
computing devices (NVIDIA, AMD, Intel,.. ). (2007/2008)
It goes mostly along the lines of the predecessor NVIDIA CUDA but using a different
terminology.
The rights were passed to a consortium that develops standards : Khronos.
This consortium develops also the OpenGL standard (2008/2009).
44. 2016-10-19 Roberto Innocente inno@sissa.it 44
OpenCL platform model
1 host + 1 or more compute devices
Host
Compute device
Compute
unit
PE
Processing
Element
45. 2016-10-19 Roberto Innocente inno@sissa.it 45
OpenCL platform model
and FPGAs
FPGA :
●
A Compute Device is an FPGA
card (there can be many in a PC)
● A Compute Unit is a pipeline
instantiated by the FPGA
OpenCL compiler (you can
implement multiple pipelines on
the FPGA : you will see in a next
slide).
● A Processing Element (PE) is
e.g. a DSP adder or multiplier in
a pipeline.
NVIDIA CUDA :
● A Compute Device is an
NVIDIA CUDA card
● A Compute Unit is a
Streaming Multiprocessor
(SM)
● A Processing Element (PE)
is a CUDA core (on NVIDIA
all cores in a warp execute
the same instruction)
46. 2016-10-19 Roberto Innocente inno@sissa.it 46
OpenCL / CUDA
Data Parallel Model
OpenCL :
● NDRange
● WorkGroup
● WorkItem
CUDA :
● Grid
● ThreadBlock
● Thread
The problem is represented as a computation
carried over a 1,2 or 3 dimensional array.
47. 2016-10-19 Roberto Innocente inno@sissa.it 47
OpenCL
NDRange, work-group, work-item
From Intel https://software.intel.com/sites/landingpage/opencl/optimization-guide/Basic_Concepts.htm
CUDA
grid
CUDA
threadblock
CUDA
thread
48. 2016-10-19 Roberto Innocente inno@sissa.it 48
OpenCL attributes for FPGA
#define NUM_SIMD_WORK_ITEMS 4
#define REQD_WORK_GROUP_SIZE (64,1,1)
#define NUM_COMPUTE_UNITS 2
#define MAX_WORK_GROUP_SIZE 512
__kernel
__attribute__((max_work_group_size( MAX_WORK_GROUP_SIZE )))
__attribute__((reqd_work_group_size REQD_WORK_GROUP_SIZE ))
__attribute__((num_compute_units( NUM_COMPUTE_UNITS )))
__attribute__((num_simd_work_items( NUM_SIMD_WORK_ITEMS )))
void function(..) { ...; }
But ..
The compiler is mostly
Resource Driven and often it
does'nt obey to your will,
despite the docs promises.
52. 2016-10-19 Roberto Innocente inno@sissa.it 52
Arria 10 killah kernel
Initializing OpenCL
Platform: Altera SDK for OpenCL
Using 1 device(s)
p385a_sch_ax115 : nalla_pcie (aclnalla_pcie0)
Using AOCX: loop.aocx
Reprogramming device with handle 1
Launching for device 0 (100000 elements)
Total runs 100000 , gflop 107374.182400
100,000 x 4 x (256*1024*1024)
Wall Time: 139909.012 ms
Gflop/s 767.457225
Kernel time (device 0): 139908.517 ms
Gflop/s 767.459945
2.0
* 2.0
+ 0.5
* 1.0
+
x[i]
res
#define N (256*1024*1024)
__kernel
void loop
(__global const float* x,
__global float *restrict y)
{
local float res;
int i = get_global_id(0);
res = x[i];
#pragma unroll 700
for(i=0;i<N;i++){
res = res*2.0f + 2.0f;
res = res*0.5f – 1.0f;
}
y[i] = res;
}
53. 2016-10-19 Roberto Innocente inno@sissa.it 53
Arria 10 killah kernel – Quartus report
++
; SpectraQ Synthesis Resource Usage Summary for Partition "|" ;
+++
; Resource ; Usage ;
+++
; Estimate of Logic utilization (ALMs needed) ; 82174 ;
; ; ;
; Combinational ALUT usage for logic ; 102803 ;
; 7 input functions ; 5 ;
; 6 input functions ; 1842 ;
; 5 input functions ; 11104 ;
; 4 input functions ; 18594 ;
; <=3 input functions ; 71258 ;
; ; ;
; Dedicated logic registers ; 151334 ;
; ; ;
;
;I/O pins ; 0 ;
; Total MLAB memory bits ; 0 ;
; Total block memory bits ; 1348604 ;
; ; ;
; Total DSP Blocks ; 1400 ;
; Total Fixed Point DSP Blocks ; 0 ;
; Total Floating Point DSP Blocks ; 1400 ;
; ; ;
; Maximum fanout node ; clock_reset_clk ;
; Maximum fanout ; 155035 ;
; Total fanout ; 692846 ;
; Average fanout ; 2.65 ;
+++
This is extremely good. It shows that the OpenCL compiler really created the same
design an experienced hardware engineer could have created using Verilog.
It used 2 DSP in mul and add mode for each line of the loop.
And it reached a performance of 50/60 % of the peak.
54. 2016-10-19 Roberto Innocente inno@sissa.it 54
XI. Programming FPGAs
with Schematics
Only very small project can be
handled using schematics
55. 2016-10-19 Roberto Innocente inno@sissa.it 55
Quartus – FPGA using Schematics
1. Create new project with wizard (give a dir and a project name), select an empty project
2. Choose FPGA model:10AX115N3F40E2SG
3. Open New File, choose a Design->Schematic : a design whiteboard opens up
4. Choose all the components you need : i/o pins, dsp blocks (choose them in integer or fp mode form
the IP catalog, a parameter editor will open up and you can program them to be
adders/multipliers/fma )
5. Connect components with busses from the top menu
58. 2016-10-19 Roberto Innocente inno@sissa.it 58
XII. Programming FPGAs with an HDL :
Verilog
Again the Scalar Product of 2 vecs of length 4
Large Projects can't be managed using Schematics :
hundredths/thousands/tens of thousands of components,
millions of interconnections , ...
59. 2016-10-19 Roberto Innocente inno@sissa.it 59
top.v
module top( x0,y0,x1,y1,x2,y2,x3,y3,z,clk,ena,aclr);
input [31:0]x0; input [31:0]y0;
input [31:0]x1; input [31:0]y1;
input [31:0]x2; input [31:0]y2;
input [31:0]x3; input [31:0]y3;
output [31:0]z;
input clk; input ena; input [1:0]aclr;
wire [31:0]ir0; wire [31:0]ir1; wire [31:0]ir2; wire [31:0]ir3;
wire [31:0]ir4; wire [31:0]ir5;
dsp_fp_mul m1(.aclr(aclr),.ay(x0),.az(y0),.clk(clk),.ena(ena),.result(ir0));
dsp_fp_mul m2(.aclr(aclr),.ay(x1),.az(y1),.clk(clk),.ena(ena),.result(ir1));
dsp_fp_mul m3(.aclr(aclr),.ay(x2),.az(y2),.clk(clk),.ena(ena),.result(ir2));
dsp_fp_mul m4(.aclr(aclr),.ay(x3),.az(y3),.clk(clk),.ena(ena),.result(ir3));
dsp_fp_add a1(.aclr(aclr),.ax(ir0),.ay(ir1),.clk(clk),.ena(ena),.result(ir4));
dsp_fp_add a2(.aclr(aclr),.ax(ir2),.ay(ir3),.clk(clk),.ena(ena),.result(ir5));
dsp_fp_add a3(.aclr(aclr),.ax(ir4),.ay(ir5),.clk(clk),.ena(ena),.result(z));
endmodule
top.v
dsp_fp_add.v
a1
dsp_fp_mul.v
m4
dsp_fp_mul.v
m1
dsp_fp_mul.v
m3
dsp_fp_mul.v
m2
dsp_fp_add.v
a3
dsp_fp_add.v
a2
In Verilog what seems a function call in fact is an instantiation of a
circuit inside another. The parameter syntax represents the
correspondence (connection) of wires with wires.
60. 2016-10-19 Roberto Innocente inno@sissa.it 60
dsp_fp_xxx
// dsp_fp_mul.v
// Generated using ACDS version 16.0 211
`timescale 1 ps / 1 ps
module dsp_fp_mul (
input wire [1:0] aclr, // aclr.aclr
input wire [31:0] ay, // ay.ay
input wire [31:0] az, // az.az
input wire clk, // clk.clk
input wire ena, // ena.ena
output wire [31:0] result // result.result
);
dsp_fp_mul_altera_fpdsp_block_160_ebvuera fpdsp_block_0 (
.clk (clk), // clk.clk
.ena (ena), // ena.ena
.aclr (aclr), // aclr.aclr
.result (result), // result.result
.ay (ay), // ay.ay
.az (az) // az.az
);
endmodule
// dsp_fp_add.v
`timescale 1 ps / 1 ps
module dsp_fp_add (a,b,c,clk,ena,aclr);
input wire [31:0]a;
input wire [31:0]b;
output wire [31:0]c;
input wire clk;
input wire ena;
input wire [1:0]aclr;
dsp_fp_add_altera_fpdsp_bloc_160_nmfrqti fdsp_block_0 (
.clk (clk),
.ena(ena),
.aclr(aclr),
.ax (a), // ax.ax
.ay (b), // ay.ay
.result (c) // result.result
);
endmodule
dsp_fp_mul.v dsp_fp_add.v
These 2 modules are generated automatically when you instantiate
from the IP cores a DSP in floating point mode and configure it like
an adder or a multiplier
61. 2016-10-19 Roberto Innocente inno@sissa.it 61
Quartus report on
Scalar Product using HDL
Exactly the same as
for the project
Using Schematics
65. 2016-10-19 Roberto Innocente inno@sissa.it 65
OpenSPL
Open Spatial Programming Language
●
Buzzword in the hands of a consortium leaded by Maxeler and Juniper on
the industrial side, Stanford Uni , Imperial College, Tokjo Uni .. on the
academic side
●
Everything kept as a trade secret for now
● Java interface ..
●
IMHO this is a lost occasion :
– “Spatial Programming” is probably the wrong word in these times in which thousand
of things around GPS, GEO, etc .. are already called in this way
– Plans and standards should be open and not kept as a secret except for consortium
members.
– The industrial members are weak on this market
– Java in this scene is, IMHO, not the right tool
– An open source movement should be started instead
66. 2016-10-19 Roberto Innocente inno@sissa.it 66
My Proposal: json-graph-fpga
Use a simple and already
existing format to describe the
graph of components. Json for
instance, or Json-graph. (We
assume all components become
connected to a global clock)
{
“inputs”:[“x0”,”x1”,”x2”,”x3”,
“y0”,”y1”,”y2”,”y3”],
“x0”:[“m1”],“y0”:[“m1”],
“x1”:[“m2”],”y1”:[“m2”],
“x2”:[“m3”],”y1”:[“m3”],
“x3”:[“m4”],”y1”:[“m4”],
“m1”:[“a1”],”m2”:[“a1”],
“m3”:[“a2”],”m4”:[“a2”],
“a1”:[“a3”],”a2”:[“a3”],
“a3”:[“outputs”]
}
I
n
p
u
t
s
O
u
t
p
u
t
s
*
m1
*
m2
*
m3
*
m4
+
a1
+
a2
+
a3
69. 2016-10-19 Roberto Innocente inno@sissa.it 69
How to lift off-board b/w limitations ?
Directly to QPI or PCIe
● Connect directly to the Intel
QPI (Quick Path
Interconnect) or the future
Intel UPI (Ultra Path
Interconnect) ,
processor/chipset point to
point interconnect (60-80
GB/s). Already done with
Xilinx chips
● Stratix 10 supports 4x PCIe
Gen3x16 ~ 60 GB/s
Stand alone
● Use FPGAs stand
alone. The Stratix 10
supports DDR4
memory or HMC
(Hybrid Memory
Cube). Connections
with Interlaken
channels support 14.7
Gb/s per lane.
75. 2016-10-19 Roberto Innocente inno@sissa.it 75
Can I use it ?
– I'm interested in making comparisons tests with Tesla
and other architectures
– I'm interested in trying kernels with sufficient Arithmetic
Intensity to run efficiently
– I'm interested in interesting problems :)
– The limit is the fact that there is only 1 board on 1 PC and
the compiler license is for 1 seat.
About this please write to me !
76. 2016-10-19 Roberto Innocente inno@sissa.it 76
"and go on till you come to the end: then stop.”
Lewis Carrol
but I think also Jacques De La Palice (or de La Palisse) could have said something like that