SlideShare una empresa de Scribd logo
1 de 77
Descargar para leer sin conexión
2016-10-19 Roberto Innocente inno@sissa.it 1
FPGA computing
@ SISSA
Roberto Innocente
inno@sissa.it
2016-10-19 Roberto Innocente inno@sissa.it 2
“Begin at the beginning ..”
Lewis Carrol, Alice in Wonderland
2016-10-19 Roberto Innocente inno@sissa.it 3
Table of Contents
1. Project history
2. What is an FPGA ?
3. INTEL/Altera Arria 10
4. 7 Dwarfs
5. Arithmetic Intensity(AI)
6. Roofline Model
7. CUDA/OpenCL
8.Actual Performance
9.OpenCL for FPGA
10.Getting the most
11.Schematics
12.HDL for FPGA
13.Spatial Computing(SC)
14.What next ?
15.. Competitors
16.Can I use it ?
2016-10-19 Roberto Innocente inno@sissa.it 4
FPGA Computing project
● Project proposed at the beginning of 2014 :
– http://people.sissa.it/~inno/pubs/reconfig-computing
-16-9-tris.pdf
● Nallatech board with Arria 10 FPGA ordered
end of April 2016-04-22
● Nallatech Board arrives 2016-06-24
● Troubles with software licenses solved mid of
august 2016-08-14
2016-10-19 Roberto Innocente inno@sissa.it 5
II. What is an FPGA ?
2016-10-19 Roberto Innocente inno@sissa.it 6
What is an FPGA ?
●
FPGA (acronym of Field Programmable Gate Array ) is a misnomer
(gates in digital electronics are very simple circuits like: and, or, not,
xor,..)
●
It is in fact an array of Configurable Logic Blocks (CLB : 6/7/8
inputs, output can be any boolean function over them or 2/3 subsets
of them)
●
A “blank slate” in which you have to program both the functions
that the Logic Blocks perform and the interconnections
between them
●
Today some of the LB, to be more efficient, are specialized (Memory
Blocks, DSP1 blocks, I/O blocks,...)
1) DSP = Digital Signal Processor (Multiplier/Adder)
2016-10-19 Roberto Innocente inno@sissa.it 7
Array of Configurable Logic Blocks
Picture from National Instuments
2016-10-19 Roberto Innocente inno@sissa.it 8
Scalar Product on an FPGA
x[0]
*
x[1]
*
x[2]
*
x[3]
*
y[0] y[1] y[2] y[3]
+ +
+
x . y = Sum x[i]*y[i]
DFG = Data Flow Graph
While with other
architectures you need to
adapt your program to the
architecture, with FPGA you
adapt the architecture to
your program.
Each cycle
a new result
after 7 flops
2016-10-19 Roberto Innocente inno@sissa.it 9
III. The FPGA of our tests
2016-10-19 Roberto Innocente inno@sissa.it 10
INTEL/Altera Arria 10
● This was the first FPGA on the market to offer
native floating point multiply/add in its DSPs.
● That's the reason why we bought it.
● Of course also on the other large FPGAs you can
implement floating point ops if you want, using the
IP cores offered by vendors.
NB. IP core : a function implemented in schematics
or an HDL not free, but proprietary (IP = Intellectual
Property)
2016-10-19 Roberto Innocente inno@sissa.it 11
INTEL (Altera) Arria 10
● INTEL Arria 10 GX1150 :
– Logic Elements 1,150 K
– ALMs 427,200
– Registers 1,708,800
– M20K mem block 2,713
– DSP 1,518 (integer and float SP)
● Back of the envelope calculation : each DSP can
output a SinglePrecision Fused Multiply Add per
cycle
– 2 × 1518 = 3036 flops × 0.5Ghz = 1500Gflop/ s
INTEL bought Altera in 2015 and
now they start to re-brand everything.
To avoid short-term obsolescence
I will call it INTEL FPGA
2016-10-19 Roberto Innocente inno@sissa.it 12
From INTEL/
Altera docs
2016-10-19 Roberto Innocente inno@sissa.it 13
IV. How to measure performance
of new architectures ?
2016-10-19 Roberto Innocente inno@sissa.it 14
The “Seven dwarfs”
At the dawn of the many core and heterogenous new computer architectures, Phil Colella of LBL, wrote the
presentation Defining Software Requirements for Scientific Computing, in which he claimed that all new
architectures should measure themselves with seven computational kernels common across every branch of
scientific computing.
These computational kernels were later cosy-named Seven Dwarfs, because like in the SnowWhite fairy tale
they should be mining for gold in new Computer Architectures.
“A dwarf is an algorithmic method that captures a pattern of computation and communication. “
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine
The dwarfs grew with time to 13.
High-end simulation in the physical sciences consists of seven algorithms:
• Structured Grids (stencils, including locally structured grids, e.g. AMR)
• Unstructured Grids
• Fast Fourier Transform
• Dense Linear Algebra
• Sparse Linear Algebra
• Particles
• Monte Carlo
Phil Colella,2004 (LBL)
2016-10-19 Roberto Innocente inno@sissa.it 15
V. Arithmetic Intensity AI
2016-10-19 Roberto Innocente inno@sissa.it 16
Arithmetic/Computational Intensity (AI)
AI=
FLOPS
bytestransferred from/to offchip memory
AI
SingleP
AI
DoubleP
Vector addition z[i] = x[i] + y[i] 1/12 0.083 1/24
Scalar product Σ a[i] * b[i] ¼ 0.125 1/8
Vector magnitude Σ a[i] * a[i] ½ 0.500 ¼
SAXPY 1/6 0.375 1/12
Stencil 4 neighbors C[i,j] = a*A[i,j]
+b*(A[i-1,j]+A[i+1,j]...)
5/24 0.208 5/48
Matrix Multiply C[i,j] = Σk
A[k,i] * B[k,j] 1/4 0.125 1/8
FFT1d 0.9* log(N) 7.48
N=4096
α∗x+ y
2016-10-19 Roberto Innocente inno@sissa.it 17
AI Arithmetic Intensity
From https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
2016-10-19 Roberto Innocente inno@sissa.it 18
VI. The Roofline Model (RM)
2016-10-19 Roberto Innocente inno@sissa.it 19
The Roofline Model
Sam Williams; A.Waterman;D. Patterson
(2009-04-01). "Roofline: An Insightful Visual
Performance Model for Multicore Architectures"
http://doi.acm.org/10.1145/1498765.1498785
● It is an intuitive visual performance model to
provide estimates of performance
● Based on two ceilings :
– Peak Flop performance of the architecture
– Maximum throughput of offchip memory
2016-10-19 Roberto Innocente inno@sissa.it 20
Arria10 Roofline Model
Break Even Point
@ AI = 250
0.01
0.1
1
10
100
1000
0.01 0.1 1 10 100 1000
AttainableGfop/s
Arithmetic intensity (AI) fops/byte transferred
Roofine model
Limits are I/O bandwidth 6GB/s, Peak fops 1.5 Tfops
1/12vectoradd
1/4scalarproduct
3/8SAXPY
1/2Vectormagnitude
5/24Stencil4neigh
Theoretical Peak (x<250)? 6*x:1500
2016-10-19 Roberto Innocente inno@sissa.it 21
VII. CUDA/OpenCL
2016-10-19 Roberto Innocente inno@sissa.it 22
Data parallelism / Task parallelism
From
www.fixtars.com
Data Parallel Task Parallel
T
A
S
K
0
..
¼ N
¼ N
..
½ N
½ N
..
¾ N
¾ N
..
N
T
A
S
K
T
A
S
K
T
A
S
K
SUM SUM
2016-10-19 Roberto Innocente inno@sissa.it 23
The rise of the CUDA/OpenCL model
● In the mid of the past decade it was clear that Moore law could be respected only
through parallelism. ManyCore and Heterogenous computers appeared: GPUs,
FPGAs, CPUs, DSPs
●
GPUs with hundredths and then thousands of simple cores (forthcoming NVIDIA
pascal ~ 3.800 [available from 2017] )
● Data parallelism can be supported with a simple model (differently from task
parallelism) : a compute pattern (kernel) instantiated on every core with a
different set of indices.
– a[i]+b[i] (Vector Addition kernel)
– Σk a[i,k] * b[k,j] (Matrix multiplication kernel)
– 1 /2 /3 dimensional NDRange / grid
● Each instantiation (work-item/thread) is provided with different parameters
through a function call (e.g. get_global_id() , in fact the core computes
displacements by itself knowing its wg and wi numbers )
2016-10-19 Roberto Innocente inno@sissa.it 24
NVIDIA pascal / GP100
● GP100 (device) : ● SM (compute unit) :
2 x vector processors 32 SIMD
(because only 1 PC per warp)
2016-10-19 Roberto Innocente inno@sissa.it 25
OpenCL for FPGAs
● There is a compiler front end (UIUC LLVM) for the HDL
Place&Route (PAR) package (in the INTEL/Altera case Quartus
Pro)
● For the FPGAs the compilers are all offline compilers. Why ?
– It takes many hours or days of CPU to synthesize a complete project
– Forget about Apple/NVIDIA examples in which OpenCL code is a string
inside the host C++ program.
– INTEL/Altera say you need 32 GB of main memory, but in fact I have
seen the compilation processes to use 40/50 GB many times (so 64 GB is
a better size).
● aoc : INTEL/Altera Offline Compiler :
– aoc krnl.cl -o krnl.aocx
2016-10-19 Roberto Innocente inno@sissa.it 26
Host source code
.c or .cpp
Host compiler
Host binary
kernel source code
.cl
AOC
FPGA binary
.aocx
Host code path Kernel code path
Execute
Host app
On host
(INTEL/Altera
Offline Compiler)
2016-10-19 Roberto Innocente inno@sissa.it 27
FPGA/OpenCL
● OpenCL was born for different computer
architectures and doesnt capture all
possibilities FPGAs can offer.
● Anyway OpenCL for FPGA seems a mature
product that offers a big step up in easy to
obtain FPGAs performance.
2016-10-19 Roberto Innocente inno@sissa.it 28
VIII. Actual performance
2016-10-19 Roberto Innocente inno@sissa.it 29
Results Reported
● All the results reported here were obtained
using INTEL/Altera OpenCL compiler 16.0.0
Build 211 and same version Quartus Pro
● In a future report I will discuss Verilog results.
2016-10-19 Roberto Innocente inno@sissa.it 30
Vector Addition
● z[i] = x[i] + y[i]
● Computational
intensity very low :
–
● Limit is then from I/O:
– 6 GB* 1/12 = 0.5
Gflops/s
 ./vector_add 
Initializing OpenCL
Platform: Altera SDK for OpenCL
Performance on CPU 1 core of intel i7 : 
   Processing time on CPU   = 1.1313ms
   Mflops/s 883.948201
Launching for device 0 (1000000 elements)
Performance on FPGA :
   Processing time on FPGA  = 6.5348ms
   Mflop/s on FPGA= 153.027972
   Time: 6.535 ms
   Kernel time (device 0): 3.668 ms
AI=
1
12
2016-10-19 Roberto Innocente inno@sissa.it 31
Stencil code
From PDE
Substitute derivatives with discrete
approximation using a symbolic
algebra package
Difference Equation
Stencil code
Code that updates a point using the
neighbor point values
3D stencil of order 8
2016-10-19 Roberto Innocente inno@sissa.it 32
Stencil code/2
●
5043 = 0.128 G points in lattice
● 5 time steps :
– 0.128 * 5 = 0.640 G points processed
● 321 ms :
– 0.640/.321 = 1.993 Gpoints/s processed
● 24 neighbors + 1 = 25 * 2 ops =
– 50 ops
● 1.993 Gpoints/s * 50 flops =
– 99 Gflops/s on FPGA
● On a single core of intel i7 cpu :
– 0.85 Gflop/s
● Arithmetic Intensity :
–
 
$ ./stencil 
Volume size: 504 x 504 x 504
order­8 stencil computation for 5 time steps
Performance on FPGA :
      Processing time : 321 ms
      Throughput = 1.9897 Gpoints / sec
      Gflops per second  99.486999
Performance on CPU intel i7 1 core : 
      Processing time on cpu = 37524.9531ms
      Throughput on cpu = 0.0171 Gpoints / sec
      Gflops per second  on cpu 0.852926
      Verifying data ­­> PASSED
CI=
2×25×N3
4×25×N3
=
1
2
… but ..
2016-10-19 Roberto Innocente inno@sissa.it 33
Matrix Multiplication
● Matrix sizes:
– A: 2048 x 1024
– B: 1024 x 1024
– C: 2048 x 1024
● FPGA 128.77 Gflops
● CPU 1.48 Gflops (on
1 core of Intel i7)
●
Generating input matrices
Launching for device 0 (global size: 
1024, 2048)
Performance of FPGA :
   Time: 33.353 ms
   Kernel time (device 0): 33.294 ms
   Throughput: 128.77 GFLOPS
Computing reference output
Performance of CPU Intel i7 single core :
   Time: 2907.730 ms
   Throughput: 1.48 GFLOPS
AI=
N2
×(2×N−1)
4×3×N
2
=
1
6
… but ..
2016-10-19 Roberto Innocente inno@sissa.it 34
A
B
CTile
Find 2 slices of A rows and B cols that you can
keep in fast memory, then you can compute the
corresponding tile of C without accessing any
other data (data re-use due to caching). This
can increase a lot the Arithmetic Intensity.
More efficient Matrix Multiplication
If you can store
stripes large k,
then you read B
only once and A
N/k times.
AI=
2N k2 N2
k2
N
2
+
N
k
×N
2
=
2N3
N
2
(1+
N
k
)
=
2
1
k
+
1
N
≈2k
2016-10-19 Roberto Innocente inno@sissa.it 35
Matrix Multiplication/2
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+
; Estimated Resource Usage Summary                                   ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­+
; Resource                               + Usage                     ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­+
; Logic utilization                      ;   42%                     ;
; ALUTs                                  ;   17%                     ;
; Dedicated logic registers              ;   25%                     ;
; Memory blocks                          ;   40%                     ;
; DSP blocks                             ;   31%                     ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­;
2016-10-19 Roberto Innocente inno@sissa.it 36
Matrix Multiplication/3
+–­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
; Resource                                    ; Usage           ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
; Estimate of Logic utilization (ALMs needed) ; 64255           ;
;                                             ;                 ;
; Combinational ALUT usage for logic          ; 49717           ;
;     ­­ 7 input functions                    ; 32              ;
;     ­­ 6 input functions                    ; 12400           ;
;     ­­ 5 input functions                    ; 1882            ;
;     ­­ 4 input functions                    ; 5526            ;
;     ­­ <=3 input functions                  ; 29877           ;
;                                             ;                 ;
; Dedicated logic registers                   ; 122269          ;
;                                             ;                 ;
; I/O pins                                    ; 0               ;
; Total MLAB memory bits                      ; 0               ;
; Total block memory bits                     ; 5220584         ;
;                                             ;                 ;
; Total DSP Blocks                            ; 392             ;
;     ­­ Total Fixed Point DSP Blocks         ; 8               ;
;     ­­ Total Floating Point DSP Blocks      ; 384             ;
;                                             ;                 ;
; Maximum fan­out node                        ; clock_reset_clk ;
; Maximum fan­out                             ; 147565          ;
; Total fan­out                               ; 914768          ;
; Average fan­out                             ; 4.56            ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
2016-10-19 Roberto Innocente inno@sissa.it 37
FFT 1d
● AI ~ 7.48
● FPGA ~ 120 Gflop/s
Fixed 4k points transform
Launching FFT transform for 2000 iterations
FFT kernel initialization is complete.
Processing time = 4.0878ms
Throughput = 2.0040 Gpoints / sec (120.2420 Gflops)
Signal to noise ratio on output sample: 137.677661 
­­> PASSED
Launching inverse FFT transform for 2000 iterations
Inverse FFT kernel initialization is complete.
Processing time = 4.0669ms
Throughput = 2.0143 Gpoints / sec (120.8579 Gflops)
Signal to noise ratio on output sample: 137.041007 
­­> PASSED
AI=
5∗N∗log(N)/log(2)
4∗2∗N
5
8
∗
log(N)
log(2)
… but ..
2016-10-19 Roberto Innocente inno@sissa.it 38
FFT 2d
● ~ 66 Gflop/s
●
Launching FFT transform (alternative data layout)
Kernel initialization is complete.
Processing time = 1.5787ms
Throughput = 0.6642 Gpoints / sec (66.4201 Gflops)
Signal to noise ratio on output sample: 137.435876 
­­> PASSED
Launching inverse FFT transform (alternative data 
layout)
Kernel initialization is complete.
Processing time = 1.5781ms
Throughput = 0.6644 Gpoints / sec (66.4440 Gflops)
Signal to noise ratio on output sample: 136.689050 
­­> PASSED
AI=
5∗N2
∗log(N2
)
4∗2∗N
2
∗log(2)
5∗2∗log(N)
4∗2∗log(2)
=
5
4
∗
log(N)
log(2)
2016-10-19 Roberto Innocente inno@sissa.it 39
Computing π with Montecarlo
Computes π with a Mersenne
twister rng.
Points = 222
GlobalWS=WG 32 , LocalWS=WI 32
I. 32x32 WI = 1024 generate 4096
rn in [0,1]x[0,1] = 222= 4194304
II.For each batch of 4096 rn
computes ins and outs respect to
the circle
III.Computes average
It takes ~ 854 ns for each rn
Using AOCX: mt.aocx
Reprogramming device with handle 1
Count all 26354932.000000 / ( rn 
4096 * rng 32 *32) * 4
Computed pi  = 3.141753
 Mersenne twister : 1954.146849[ms]
 Computing     pi : 1632.340594[ms]
 Copy results     :    0.077611[ms]
 Total time       : 3586.565054[ms]
2016-10-19 Roberto Innocente inno@sissa.it 40
Sobel filter
● 1920 x 1080 pixels image, 3 x 8 planes
color ~ 6 MB
●
Filter can be applied 140 fps
●
luma=(([R G B]
[
66
129
25 ]+128)≫8)+16 RecBT709
Sobel Operators Sx=
[
−1
−2
−1
0
0
0
1
2
1]Sy=
[
−1
0
+1
−2
0
+2
−1
0
+1]∂I
∂x
=I∗Sx ,
∂I
∂ y
=I∗Sy
∇I=[∂I
∂x
∂I
∂ y ], ‖∇ I‖=
√(∂I
∂ x)
2
+(∂ I
∂ y)
2
‖∇ I (i , j)‖ < θ → pixel(i, j)=(0,0,0)
Convolution
2016-10-19 Roberto Innocente inno@sissa.it 41
Other implementations
● Smith-Waterman
– Algorithm for
computing the best
match (with gaps and
mismatches) between
2 DNA sequences
Status : in progress
● Spiking Neurons
– McCulloch-Pitts (and later
Rosenblatt perceptron) are
too simple models of
neuron communication. In
fact neurons for sure use
spikes frequency to signal
strength of activation or
maybe even use spikes as
a kind of binary code
between them
Status: thought about it
2016-10-19 Roberto Innocente inno@sissa.it 42
IX. More on OpenCL for FPGA
2016-10-19 Roberto Innocente inno@sissa.it 43
OpenCL
https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf
Originally authored by Apple, bored by the need to support all the new coming
computing devices (NVIDIA, AMD, Intel,.. ). (2007/2008)
It goes mostly along the lines of the predecessor NVIDIA CUDA but using a different
terminology.
The rights were passed to a consortium that develops standards : Khronos.
This consortium develops also the OpenGL standard (2008/2009).
2016-10-19 Roberto Innocente inno@sissa.it 44
OpenCL platform model
1 host + 1 or more compute devices
Host
Compute device
Compute
unit
PE
Processing
Element
2016-10-19 Roberto Innocente inno@sissa.it 45
OpenCL platform model
and FPGAs
FPGA :
●
A Compute Device is an FPGA
card (there can be many in a PC)
● A Compute Unit is a pipeline
instantiated by the FPGA
OpenCL compiler (you can
implement multiple pipelines on
the FPGA : you will see in a next
slide).
● A Processing Element (PE) is
e.g. a DSP adder or multiplier in
a pipeline.
NVIDIA CUDA :
● A Compute Device is an
NVIDIA CUDA card
● A Compute Unit is a
Streaming Multiprocessor
(SM)
● A Processing Element (PE)
is a CUDA core (on NVIDIA
all cores in a warp execute
the same instruction)
2016-10-19 Roberto Innocente inno@sissa.it 46
OpenCL / CUDA
Data Parallel Model
OpenCL :
● NDRange
● WorkGroup
● WorkItem
CUDA :
● Grid
● ThreadBlock
● Thread
The problem is represented as a computation
carried over a 1,2 or 3 dimensional array.
2016-10-19 Roberto Innocente inno@sissa.it 47
OpenCL
NDRange, work-group, work-item
From Intel https://software.intel.com/sites/landingpage/opencl/optimization-guide/Basic_Concepts.htm
CUDA
grid
CUDA
threadblock
CUDA
thread
2016-10-19 Roberto Innocente inno@sissa.it 48
OpenCL attributes for FPGA
#define NUM_SIMD_WORK_ITEMS  4  
#define REQD_WORK_GROUP_SIZE (64,1,1) 
#define NUM_COMPUTE_UNITS  2 
#define MAX_WORK_GROUP_SIZE 512  
__kernel 
__attribute__((max_work_group_size( MAX_WORK_GROUP_SIZE )))
__attribute__((reqd_work_group_size REQD_WORK_GROUP_SIZE ))
__attribute__((num_compute_units( NUM_COMPUTE_UNITS )))
__attribute__((num_simd_work_items( NUM_SIMD_WORK_ITEMS )))
void function(..) { ...; }
             
But ..
The compiler is mostly
Resource Driven and often it
does'nt obey to your will,
despite the docs promises.
2016-10-19 Roberto Innocente inno@sissa.it 49
Vector Addition/Matrix Multiplication
OpenCL kernels
// vector addition 
C:
  for(i=0;i<N;i++){
         C[i] = A[i]+B[i];
  }        
OpenCL:
__kernel void vecadd(__global const float* A,
                __global const float* B,
                __global float* C)
{
      i = get_global_id(0);
      C[i] = A[i] + B[i];
}
// matrix multiplication
C:
  for(i=0;i<N;i++){
    for(j=0;j<N;j++){
      Temp = 0.0f;
      for(k=0;k<N;k++){
        Temp += A[i][k] * B[k][j]
      }
      C[i][j] = Temp;
    }   
  }
OpenCL:
__kernel void matmul(__global const float* A,
                     __global const float* B,
                     __global float* C, )
{
      __local float sum;
      i = get_global_id(0);
      j = get_global_id(1);
      sum = 0.0f;
      for(k=0;k<N;k++) {
          sum += A[i][k] * B[k][j];
      }
      C[i][j] = sum;
}
2016-10-19 Roberto Innocente inno@sissa.it 50
X. Getting the most
we need to look at the architecture !
2016-10-19 Roberto Innocente inno@sissa.it 51
Arria 10 DSP in Floating Point mode
2016-10-19 Roberto Innocente inno@sissa.it 52
Arria 10 killah kernel
Initializing OpenCL
Platform: Altera SDK for OpenCL
Using 1 device(s)
p385a_sch_ax115 : nalla_pcie (aclnalla_pcie0)
Using AOCX: loop.aocx
Reprogramming device with handle 1
Launching for device 0 (100000 elements)
Total runs 100000 , gflop 107374.182400
100,000 x 4 x (256*1024*1024)
Wall Time: 139909.012 ms
Gflop/s 767.457225
Kernel time (device 0): 139908.517 ms
Gflop/s 767.459945
2.0
* 2.0
+ 0.5
* ­1.0
+
x[i]
res
#define N (256*1024*1024)
__kernel
void loop
(__global const float* x,
 __global float *restrict  y)
{
local float res;
int i = get_global_id(0);
res  = x[i];
#pragma unroll 700
  for(i=0;i<N;i++){
    res = res*2.0f + 2.0f;
    res = res*0.5f – 1.0f;
  }
  y[i] = res;
}
2016-10-19 Roberto Innocente inno@sissa.it 53
Arria 10 killah kernel – Quartus report
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+
; Spectra­Q Synthesis Resource Usage Summary for Partition "|"  ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
; Resource                                    ; Usage           ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
; Estimate of Logic utilization (ALMs needed) ; 82174           ;
;                                             ;                 ;
; Combinational ALUT usage for logic          ; 102803          ;
;     ­­ 7 input functions                    ; 5               ;
;     ­­ 6 input functions                    ; 1842            ;
;     ­­ 5 input functions                    ; 11104           ;
;     ­­ 4 input functions                    ; 18594           ;
;     ­­ <=3 input functions                  ; 71258           ;
;                                             ;                 ;
; Dedicated logic registers                   ; 151334          ;
;                                             ;                 ;
; 
;I/O pins                                     ; 0               ;
; Total MLAB memory bits                      ; 0               ;
; Total block memory bits                     ; 1348604         ;
;                                             ;                 ;
; Total DSP Blocks                            ; 1400            ;
;     ­­ Total Fixed Point DSP Blocks         ; 0               ;
;     ­­ Total Floating Point DSP Blocks      ; 1400            ;
;                                             ;                 ;
; Maximum fan­out node                        ; clock_reset_clk ;
; Maximum fan­out                             ; 155035          ;
; Total fan­out                               ; 692846          ;
; Average fan­out                             ; 2.65            ;
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
This is extremely good. It shows that the OpenCL compiler really created the same
design an experienced hardware engineer could have created using Verilog.
It used 2 DSP in mul and add mode for each line of the loop.
And it reached a performance of 50/60 % of the peak.
2016-10-19 Roberto Innocente inno@sissa.it 54
XI. Programming FPGAs
with Schematics
Only very small project can be
handled using schematics
2016-10-19 Roberto Innocente inno@sissa.it 55
Quartus – FPGA using Schematics
1. Create new project with wizard (give a dir and a project name), select an empty project
2. Choose FPGA model:10AX115N3F40E2SG
3. Open New File, choose a Design->Schematic : a design whiteboard opens up
4. Choose all the components you need : i/o pins, dsp blocks (choose them in integer or fp mode form
the IP catalog, a parameter editor will open up and you can program them to be
adders/multipliers/fma )
5. Connect components with busses from the top menu
2016-10-19 Roberto Innocente inno@sissa.it 56
Quartus – Schematic for scalar product
2016-10-19 Roberto Innocente inno@sissa.it 57
Quartus report using Schematics
2016-10-19 Roberto Innocente inno@sissa.it 58
XII. Programming FPGAs with an HDL :
Verilog
Again the Scalar Product of 2 vecs of length 4
Large Projects can't be managed using Schematics :
hundredths/thousands/tens of thousands of components,
millions of interconnections , ...
2016-10-19 Roberto Innocente inno@sissa.it 59
top.v
module top( x0,y0,x1,y1,x2,y2,x3,y3,z,clk,ena,aclr);
input [31:0]x0; input [31:0]y0;
input [31:0]x1; input [31:0]y1;
input [31:0]x2; input [31:0]y2;
input [31:0]x3; input [31:0]y3;      
output [31:0]z;
input clk; input ena; input [1:0]aclr;
wire [31:0]ir0; wire [31:0]ir1; wire [31:0]ir2; wire [31:0]ir3; 
wire [31:0]ir4; wire [31:0]ir5;
    dsp_fp_mul m1(.aclr(aclr),.ay(x0),.az(y0),.clk(clk),.ena(ena),.result(ir0));
    dsp_fp_mul m2(.aclr(aclr),.ay(x1),.az(y1),.clk(clk),.ena(ena),.result(ir1));
    dsp_fp_mul m3(.aclr(aclr),.ay(x2),.az(y2),.clk(clk),.ena(ena),.result(ir2));
    dsp_fp_mul m4(.aclr(aclr),.ay(x3),.az(y3),.clk(clk),.ena(ena),.result(ir3));
        
    dsp_fp_add a1(.aclr(aclr),.ax(ir0),.ay(ir1),.clk(clk),.ena(ena),.result(ir4));
    dsp_fp_add a2(.aclr(aclr),.ax(ir2),.ay(ir3),.clk(clk),.ena(ena),.result(ir5));
    dsp_fp_add a3(.aclr(aclr),.ax(ir4),.ay(ir5),.clk(clk),.ena(ena),.result(z));
endmodule
top.v
dsp_fp_add.v
a1
dsp_fp_mul.v
m4
dsp_fp_mul.v
m1
dsp_fp_mul.v
m3
dsp_fp_mul.v
m2
dsp_fp_add.v
a3
dsp_fp_add.v
a2
In Verilog what seems a function call in fact is an instantiation of a
circuit inside another. The parameter syntax represents the
correspondence (connection) of wires with wires.
2016-10-19 Roberto Innocente inno@sissa.it 60
dsp_fp_xxx
// dsp_fp_mul.v
// Generated using ACDS version 16.0 211
`timescale 1 ps / 1 ps
module dsp_fp_mul (
input  wire [1:0]  aclr,   //   aclr.aclr
input  wire [31:0] ay,     //     ay.ay
input  wire [31:0] az,     //     az.az
input  wire        clk,    //    clk.clk
input  wire        ena,    //    ena.ena
output wire [31:0] result  // result.result
);
dsp_fp_mul_altera_fpdsp_block_160_ebvuera fpdsp_block_0 (
.clk    (clk),    //    clk.clk
.ena    (ena),    //    ena.ena
.aclr   (aclr),   //   aclr.aclr
.result (result), // result.result
.ay     (ay),     //     ay.ay
.az     (az)      //     az.az
);
endmodule
// dsp_fp_add.v
`timescale 1 ps / 1 ps
module dsp_fp_add (a,b,c,clk,ena,aclr);
input wire [31:0]a;
input wire [31:0]b;
output wire [31:0]c;
input wire clk;
input wire ena;
input wire [1:0]aclr;
dsp_fp_add_altera_fpdsp_bloc_160_nmfrqti fdsp_block_0 (
.clk (clk),
.ena(ena),
.aclr(aclr),
.ax     (a),     //     ax.ax
.ay     (b),     //     ay.ay
.result (c)  // result.result
);
endmodule
dsp_fp_mul.v dsp_fp_add.v
These 2 modules are generated automatically when you instantiate
from the IP cores a DSP in floating point mode and configure it like
an adder or a multiplier
2016-10-19 Roberto Innocente inno@sissa.it 61
Quartus report on
Scalar Product using HDL
Exactly the same as
for the project
Using Schematics
2016-10-19 Roberto Innocente inno@sissa.it 62
System Verilog1
– Killah kernel
sp_12.sv
module sp_12
#( parameter N=700)
(
input logic [31:0]x,
output logic [31:0]out,
input logic clk,ena,
input logic [1:0]aclr
);
logic [31:0] mul_2,add_2,
mul_05,sub_1;
logic [31:0]ir[2*N+4];
assign mul_2 = shortreal'(2.0);
assign add_2 = shortreal'(2.0);
assign mul_05 = shortreal'(0.5);
assign sub_1 = shortreal'(-1.0);
assign ir[0] = x;
genvar i;
generate
for(i=0;i<=N;i=i+1)
begin: FMA2_LOOP
dsp_fp_fma inst (
.ax(add_2),
.ay(ir[2*i]),
.az(mul_2),
.result(ir[2*i+1]),
.clk(clk),
.ena(ena),
.aclr(aclr)
);
dsp_fp_fma inst1(
.ax(add_2),
.ay(ir[2*i+1]),
.az(mul_2),
.result(ir[2*i+2]),
.clk(clk),
.ena(ena),
.aclr(aclr)
);
end
endgenerate
assign out = ir[2*N+2];
endmodule
Quartus report :
1,402 DSP used1)
SystemVerilog is a new edition of
Verilog (1800-2012) with many additions
2016-10-19 Roberto Innocente inno@sissa.it 63
// dsp_fp_fma.v
// Generated using ACDS version 16.0 211
`timescale 1 ps / 1 ps
module dsp_fp_fma (
input  wire [1:0]  aclr,   //   aclr.aclr
input  wire [31:0] ax,     //     ax.ax
input  wire [31:0] ay,     //     ay.ay
input  wire [31:0] az,     //     az.az
input  wire        clk,    //    clk.clk
input  wire        ena,    //    ena.ena
output wire [31:0] result  // result.result
);
dsp_fp_fma_altera_fpdsp_block_160_fj4u2my fpdsp_block_0 (
.clk    (clk),    //    clk.clk
.ena    (ena),    //    ena.ena
.aclr   (aclr),   //   aclr.aclr
.result (result), // result.result
.ax     (ax),     //     ax.ax
.ay     (ay),     //     ay.ay
.az     (az)      //     az.az
);
endmodule
Verilog : dsp_fp_fma.v
This file is generated
automatically when you
instantiate a DSP as a
multiplier/adder with the
parameter editor. It differs
from the others that
resulted from single
operation instantiation (like
only mul or only add) : it
uses all 3 input busses as
you can see.
2016-10-19 Roberto Innocente inno@sissa.it 64
XIII. Spatial Computing (OpenSPL)
2016-10-19 Roberto Innocente inno@sissa.it 65
OpenSPL
Open Spatial Programming Language
●
Buzzword in the hands of a consortium leaded by Maxeler and Juniper on
the industrial side, Stanford Uni , Imperial College, Tokjo Uni .. on the
academic side
●
Everything kept as a trade secret for now
● Java interface ..
●
IMHO this is a lost occasion :
– “Spatial Programming” is probably the wrong word in these times in which thousand
of things around GPS, GEO, etc .. are already called in this way
– Plans and standards should be open and not kept as a secret except for consortium
members.
– The industrial members are weak on this market
– Java in this scene is, IMHO, not the right tool
– An open source movement should be started instead
2016-10-19 Roberto Innocente inno@sissa.it 66
My Proposal: json-graph-fpga
Use a simple and already
existing format to describe the
graph of components. Json for
instance, or Json-graph. (We
assume all components become
connected to a global clock)
{
“inputs”:[“x0”,”x1”,”x2”,”x3”,
“y0”,”y1”,”y2”,”y3”],
“x0”:[“m1”],“y0”:[“m1”],
“x1”:[“m2”],”y1”:[“m2”],
“x2”:[“m3”],”y1”:[“m3”],
“x3”:[“m4”],”y1”:[“m4”],
“m1”:[“a1”],”m2”:[“a1”],
“m3”:[“a2”],”m4”:[“a2”],
“a1”:[“a3”],”a2”:[“a3”],
“a3”:[“outputs”]
}
I
n
p
u
t
s
O
u
t
p
u
t
s
*
m1
*
m2
*
m3
*
m4
+
a1
+
a2
+
a3
2016-10-19 Roberto Innocente inno@sissa.it 67
XIV. What's next ?
2016-10-19 Roberto Innocente inno@sissa.it 68
Top INTEL/Altera Product Stratix 10
● Arria 10 (10AX115)
– 20nm technology
–
– Log El 1,150,000
– ALM 472,500
– DSP 1,518
– M20Blk 2,713
– Reg 1,708,800
– PeakTflops 1.5
● Stratix 10(GX2800)
– Intel 14nm
(TriGate) FinFET
– Log El 2,753,000
– ALM 933,120
– DSP 5,760
– M20Blk 11,721
– Reg 3,732,480
– PeakTflops 10
Stratix 10 = 6 x (fp perf of Arria 10)
2016-10-19 Roberto Innocente inno@sissa.it 69
How to lift off-board b/w limitations ?
Directly to QPI or PCIe
● Connect directly to the Intel
QPI (Quick Path
Interconnect) or the future
Intel UPI (Ultra Path
Interconnect) ,
processor/chipset point to
point interconnect (60-80
GB/s). Already done with
Xilinx chips
● Stratix 10 supports 4x PCIe
Gen3x16 ~ 60 GB/s
Stand alone
● Use FPGAs stand
alone. The Stratix 10
supports DDR4
memory or HMC
(Hybrid Memory
Cube). Connections
with Interlaken
channels support 14.7
Gb/s per lane.
2016-10-19 Roberto Innocente inno@sissa.it 70
XV. Competitors
2016-10-19 Roberto Innocente inno@sissa.it 71
Competitors
● NVIDIA P100 (next year)
– 3,584 cores
– 1,328 Mhz 300 W, 1,126 Mhz 250 W
– CUDA 6.0
– Single Precision Gflops 8,000-10,000
– 3584*1328*2 = 9,519 TFlops
– TDP 250-300 Watt
– ~ 10,000-12,000 USD
● INTEL Xeon Phi 7290
– 72 cores
– Freq. 1.50 Ghz
– TDP 245 Watt
– ~ 4,110 USD
● INTEL Xeon E5-4699v4
– 22 cores
– Freq. 2.20 Ghz
– TDP 135 Watt
– ~ 7,000 USD
●
INTEL Arria 10 GX
– 1518 DSP
– Freq. 0.5 Ghz
– Peak 1518*2*0.5 = 1.5
Tflops
– TDP ~ 30 Watt
– ~ 5,000 USD
INTEL/Altera FPGAs
NVIDIA GPGPU
INTEL Xeon Phi
INTEL Xeon
● INTEL Stratix 10
– 5,760 DSP
– Freq. 1.0 Ghz
– Peak 10 Tflops
– TDP ~ 30-40 Watt
– ~ ??? 20 K USD
2016-10-19 Roberto Innocente inno@sissa.it 72
Competitors/2
Arria 10 Stratix10 INTEL E5-2699v4 INTEL Phi 7290 NVIDIAP100
0
5000
10000
15000
20000
25000
30000
35000
TDP / Peak GFlop/s / Price
TDPWatt x100
Peak FP- GFlop
Price
2016-10-19 Roberto Innocente inno@sissa.it 73
Competitors/3
Arria 10 Stratix10 INTEL E5-2699v4 INTEL Phi 7290 NVIDIAP100
0
50
100
150
200
250
300
GFlops / Watt
GFlops/Watt
2016-10-19 Roberto Innocente inno@sissa.it 74
XVI. Can I use it ?
2016-10-19 Roberto Innocente inno@sissa.it 75
Can I use it ?
– I'm interested in making comparisons tests with Tesla
and other architectures
– I'm interested in trying kernels with sufficient Arithmetic
Intensity to run efficiently
– I'm interested in interesting problems :)
– The limit is the fact that there is only 1 board on 1 PC and
the compiler license is for 1 seat.
About this please write to me !
2016-10-19 Roberto Innocente inno@sissa.it 76
"and go on till you come to the end: then stop.”
Lewis Carrol
but I think also Jacques De La Palice (or de La Palisse) could have said something like that
2016-10-19 Roberto Innocente inno@sissa.it 77
END

Más contenido relacionado

La actualidad más candente

Design options for digital systems
Design options for digital systemsDesign options for digital systems
Design options for digital systemsdennis gookyi
 
Hard IP Core design | Convolution Encoder
Hard IP Core design | Convolution EncoderHard IP Core design | Convolution Encoder
Hard IP Core design | Convolution EncoderArchit Vora
 
Nios2 and ip core
Nios2 and ip coreNios2 and ip core
Nios2 and ip coreanishgoel
 
VHDL Practical Exam Guide
VHDL Practical Exam GuideVHDL Practical Exam Guide
VHDL Practical Exam GuideEslam Mohammed
 
0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introductionMaurizio Donna
 
SoC~FPGA~ASIC~Embedded
SoC~FPGA~ASIC~EmbeddedSoC~FPGA~ASIC~Embedded
SoC~FPGA~ASIC~EmbeddedChili.CHIPS
 
Cpld and fpga mod vi
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod viAgi George
 
4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design FlowMaurizio Donna
 
programmable logic array
programmable logic arrayprogrammable logic array
programmable logic arrayShiraz Azeem
 
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyDesign of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyTELKOMNIKA JOURNAL
 
1.FPGA for dummies: Basic FPGA architecture
1.FPGA for dummies: Basic FPGA architecture 1.FPGA for dummies: Basic FPGA architecture
1.FPGA for dummies: Basic FPGA architecture Maurizio Donna
 
Advance hdl design training on xilinx fpga
Advance hdl design training on xilinx fpgaAdvance hdl design training on xilinx fpga
Advance hdl design training on xilinx fpgademon_2M
 
FPGA in outer space
FPGA in outer spaceFPGA in outer space
FPGA in outer spaceAgradeepSett
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
 

La actualidad más candente (20)

Design options for digital systems
Design options for digital systemsDesign options for digital systems
Design options for digital systems
 
Hard IP Core design | Convolution Encoder
Hard IP Core design | Convolution EncoderHard IP Core design | Convolution Encoder
Hard IP Core design | Convolution Encoder
 
Nios2 and ip core
Nios2 and ip coreNios2 and ip core
Nios2 and ip core
 
VHDL Practical Exam Guide
VHDL Practical Exam GuideVHDL Practical Exam Guide
VHDL Practical Exam Guide
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
FPGA workshop
FPGA workshopFPGA workshop
FPGA workshop
 
Lab9500
Lab9500Lab9500
Lab9500
 
0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction
 
SoC~FPGA~ASIC~Embedded
SoC~FPGA~ASIC~EmbeddedSoC~FPGA~ASIC~Embedded
SoC~FPGA~ASIC~Embedded
 
Cpld and fpga mod vi
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod vi
 
4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow
 
CPLDs
CPLDsCPLDs
CPLDs
 
Fpga in space
Fpga in spaceFpga in space
Fpga in space
 
programmable logic array
programmable logic arrayprogrammable logic array
programmable logic array
 
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyDesign of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
 
1.FPGA for dummies: Basic FPGA architecture
1.FPGA for dummies: Basic FPGA architecture 1.FPGA for dummies: Basic FPGA architecture
1.FPGA for dummies: Basic FPGA architecture
 
Flash memory
Flash memoryFlash memory
Flash memory
 
Advance hdl design training on xilinx fpga
Advance hdl design training on xilinx fpgaAdvance hdl design training on xilinx fpga
Advance hdl design training on xilinx fpga
 
FPGA in outer space
FPGA in outer spaceFPGA in outer space
FPGA in outer space
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 

Destacado

Track h asic prototyping - logtel
Track h   asic prototyping - logtelTrack h   asic prototyping - logtel
Track h asic prototyping - logtelchiportal
 
Using Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC WorkloadsUsing Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC Workloadsinside-BigData.com
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 
(final) RNN Implementation on FPGA
(final) RNN Implementation on FPGA(final) RNN Implementation on FPGA
(final) RNN Implementation on FPGAXILING YIN
 
Altera’s Role In Accelerating the Internet of Things
Altera’s Role In Accelerating the Internet of ThingsAltera’s Role In Accelerating the Internet of Things
Altera’s Role In Accelerating the Internet of ThingsAltera Corporation
 
Altera SDK for OpenCL解体新書 : ホストとデバイスの関係
Altera SDK for OpenCL解体新書 : ホストとデバイスの関係Altera SDK for OpenCL解体新書 : ホストとデバイスの関係
Altera SDK for OpenCL解体新書 : ホストとデバイスの関係Mr. Vengineer
 
Mastering FPGA Design through Debug, Adrian Hernandez, Xilinx
Mastering FPGA Design through Debug, Adrian Hernandez, XilinxMastering FPGA Design through Debug, Adrian Hernandez, Xilinx
Mastering FPGA Design through Debug, Adrian Hernandez, XilinxFPGA Central
 
02 История Open-Source Hardware - Олег Садов
02 История Open-Source Hardware - Олег Садов02 История Open-Source Hardware - Олег Садов
02 История Open-Source Hardware - Олег СадовAlexander Chemeris
 
динамическое управление частотой выборки ацп с помощью фапч
динамическое управление частотой выборки ацп с помощью фапчдинамическое управление частотой выборки ацп с помощью фапч
динамическое управление частотой выборки ацп с помощью фапчAndrey Skladchikov
 
2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...
2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...
2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...RF-Lab
 
использование .Net framework
использование .Net frameworkиспользование .Net framework
использование .Net frameworkjskonst
 
снк передачи данных Atl186 ofdm-share
снк передачи данных Atl186 ofdm-shareснк передачи данных Atl186 ofdm-share
снк передачи данных Atl186 ofdm-sharePavel Ivanov
 
Altera SDK for OpenCL解体新書 perlスクリプト編
Altera SDK for OpenCL解体新書 perlスクリプト編Altera SDK for OpenCL解体新書 perlスクリプト編
Altera SDK for OpenCL解体新書 perlスクリプト編Mr. Vengineer
 
MIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAs
MIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAsMIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAs
MIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAsMIPI Alliance
 
Gps игры: как использовать смартфоны в обучении
Gps игры: как использовать смартфоны в обученииGps игры: как использовать смартфоны в обучении
Gps игры: как использовать смартфоны в обученииАнатолий Шперх
 
FPGA Applications in Finance
FPGA Applications in FinanceFPGA Applications in Finance
FPGA Applications in Financezpektral
 
смартфон как физическая лаборатория
смартфон как физическая лабораториясмартфон как физическая лаборатория
смартфон как физическая лабораторияАнатолий Шперх
 

Destacado (20)

Track h asic prototyping - logtel
Track h   asic prototyping - logtelTrack h   asic prototyping - logtel
Track h asic prototyping - logtel
 
Using Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC WorkloadsUsing Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC Workloads
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
(final) RNN Implementation on FPGA
(final) RNN Implementation on FPGA(final) RNN Implementation on FPGA
(final) RNN Implementation on FPGA
 
Stratix FPGA Overview
Stratix FPGA OverviewStratix FPGA Overview
Stratix FPGA Overview
 
Altera’s Role In Accelerating the Internet of Things
Altera’s Role In Accelerating the Internet of ThingsAltera’s Role In Accelerating the Internet of Things
Altera’s Role In Accelerating the Internet of Things
 
Altera SDK for OpenCL解体新書 : ホストとデバイスの関係
Altera SDK for OpenCL解体新書 : ホストとデバイスの関係Altera SDK for OpenCL解体新書 : ホストとデバイスの関係
Altera SDK for OpenCL解体新書 : ホストとデバイスの関係
 
Mastering FPGA Design through Debug, Adrian Hernandez, Xilinx
Mastering FPGA Design through Debug, Adrian Hernandez, XilinxMastering FPGA Design through Debug, Adrian Hernandez, Xilinx
Mastering FPGA Design through Debug, Adrian Hernandez, Xilinx
 
02 История Open-Source Hardware - Олег Садов
02 История Open-Source Hardware - Олег Садов02 История Open-Source Hardware - Олег Садов
02 История Open-Source Hardware - Олег Садов
 
динамическое управление частотой выборки ацп с помощью фапч
динамическое управление частотой выборки ацп с помощью фапчдинамическое управление частотой выборки ацп с помощью фапч
динамическое управление частотой выборки ацп с помощью фапч
 
2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...
2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...
2011 Никифоров А.А. доклад " Применение алгоритма DELAY AND MULTIPLY APPROACH...
 
DSP / Filters
DSP / FiltersDSP / Filters
DSP / Filters
 
04.02 Marsohod
04.02 Marsohod04.02 Marsohod
04.02 Marsohod
 
использование .Net framework
использование .Net frameworkиспользование .Net framework
использование .Net framework
 
снк передачи данных Atl186 ofdm-share
снк передачи данных Atl186 ofdm-shareснк передачи данных Atl186 ofdm-share
снк передачи данных Atl186 ofdm-share
 
Altera SDK for OpenCL解体新書 perlスクリプト編
Altera SDK for OpenCL解体新書 perlスクリプト編Altera SDK for OpenCL解体新書 perlスクリプト編
Altera SDK for OpenCL解体新書 perlスクリプト編
 
MIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAs
MIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAsMIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAs
MIPI DevCon 2016: Multiple MIPI CSI-2 Cameras Leveraging FPGAs
 
Gps игры: как использовать смартфоны в обучении
Gps игры: как использовать смартфоны в обученииGps игры: как использовать смартфоны в обучении
Gps игры: как использовать смартфоны в обучении
 
FPGA Applications in Finance
FPGA Applications in FinanceFPGA Applications in Finance
FPGA Applications in Finance
 
смартфон как физическая лаборатория
смартфон как физическая лабораториясмартфон как физическая лаборатория
смартфон как физическая лаборатория
 

Similar a Fpga computing

Build an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookBuild an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookRoberto Innocenti
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfDuy-Hieu Bui
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar reportrahul kumar verma
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONFROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONieijjournal1
 
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONFROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONieijjournal
 
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONFROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONieijjournal
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
Berlin Embedded Linux meetup: How to Linux on RISC-V
Berlin Embedded Linux meetup: How to Linux on RISC-VBerlin Embedded Linux meetup: How to Linux on RISC-V
Berlin Embedded Linux meetup: How to Linux on RISC-VDrew Fustini
 
Cost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IP
Cost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IPCost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IP
Cost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IPCAST, Inc.
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 

Similar a Fpga computing (20)

FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
 
Build an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookBuild an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC Notebook
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
 
4_BIT_ALU
4_BIT_ALU4_BIT_ALU
4_BIT_ALU
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONFROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
 
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONFROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
 
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONFROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATION
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Berlin Embedded Linux meetup: How to Linux on RISC-V
Berlin Embedded Linux meetup: How to Linux on RISC-VBerlin Embedded Linux meetup: How to Linux on RISC-V
Berlin Embedded Linux meetup: How to Linux on RISC-V
 
Cost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IP
Cost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IPCost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IP
Cost-Effective System Continuation using Xilinx FPGAs and Legacy Processor IP
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 

Más de rinnocente

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018rinnocente
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introductionrinnocente
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networkingrinnocente
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?rinnocente
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailrinnocente
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsrinnocente
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the webrinnocente
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography rinnocente
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit erarinnocente
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration rinnocente
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch predictionrinnocente
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithmsrinnocente
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)rinnocente
 

Más de rinnocente (15)

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introduction
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated email
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the web
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit era
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch prediction
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithms
 
Ipv6 course
Ipv6  courseIpv6  course
Ipv6 course
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)
 

Último

9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...Pooja Nehwal
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsDeira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsEscorts Call Girls
 
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...MOHANI PANDEY
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)amitlee9823
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Low Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service Nashik
Low Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service NashikLow Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service Nashik
Low Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Call Girls in Nagpur High Profile
 
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Pooja Nehwal
 
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)kojalkojal131
 
VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...
VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...
VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...Call Girls in Nagpur High Profile
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Naicy mandal
 
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...Pooja Nehwal
 
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...anilsa9823
 
Lubrication and it's types and properties of the libricabt
Lubrication and it's types and properties of the libricabtLubrication and it's types and properties of the libricabt
Lubrication and it's types and properties of the libricabtdineshkumar430venkat
 

Último (20)

9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
 
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsDeira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
 
🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate
🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate
🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate
 
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
 
Low Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service Nashik
Low Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service NashikLow Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service Nashik
Low Rate Call Girls Nashik Vedika 7001305949 Independent Escort Service Nashik
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
 
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
 
VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...
VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...
VVIP Pune Call Girls Karve Nagar (7001035870) Pune Escorts Nearby with Comple...
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
 
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
 
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
 
Lubrication and it's types and properties of the libricabt
Lubrication and it's types and properties of the libricabtLubrication and it's types and properties of the libricabt
Lubrication and it's types and properties of the libricabt
 

Fpga computing

  • 1. 2016-10-19 Roberto Innocente inno@sissa.it 1 FPGA computing @ SISSA Roberto Innocente inno@sissa.it
  • 2. 2016-10-19 Roberto Innocente inno@sissa.it 2 “Begin at the beginning ..” Lewis Carrol, Alice in Wonderland
  • 3. 2016-10-19 Roberto Innocente inno@sissa.it 3 Table of Contents 1. Project history 2. What is an FPGA ? 3. INTEL/Altera Arria 10 4. 7 Dwarfs 5. Arithmetic Intensity(AI) 6. Roofline Model 7. CUDA/OpenCL 8.Actual Performance 9.OpenCL for FPGA 10.Getting the most 11.Schematics 12.HDL for FPGA 13.Spatial Computing(SC) 14.What next ? 15.. Competitors 16.Can I use it ?
  • 4. 2016-10-19 Roberto Innocente inno@sissa.it 4 FPGA Computing project ● Project proposed at the beginning of 2014 : – http://people.sissa.it/~inno/pubs/reconfig-computing -16-9-tris.pdf ● Nallatech board with Arria 10 FPGA ordered end of April 2016-04-22 ● Nallatech Board arrives 2016-06-24 ● Troubles with software licenses solved mid of august 2016-08-14
  • 5. 2016-10-19 Roberto Innocente inno@sissa.it 5 II. What is an FPGA ?
  • 6. 2016-10-19 Roberto Innocente inno@sissa.it 6 What is an FPGA ? ● FPGA (acronym of Field Programmable Gate Array ) is a misnomer (gates in digital electronics are very simple circuits like: and, or, not, xor,..) ● It is in fact an array of Configurable Logic Blocks (CLB : 6/7/8 inputs, output can be any boolean function over them or 2/3 subsets of them) ● A “blank slate” in which you have to program both the functions that the Logic Blocks perform and the interconnections between them ● Today some of the LB, to be more efficient, are specialized (Memory Blocks, DSP1 blocks, I/O blocks,...) 1) DSP = Digital Signal Processor (Multiplier/Adder)
  • 7. 2016-10-19 Roberto Innocente inno@sissa.it 7 Array of Configurable Logic Blocks Picture from National Instuments
  • 8. 2016-10-19 Roberto Innocente inno@sissa.it 8 Scalar Product on an FPGA x[0] * x[1] * x[2] * x[3] * y[0] y[1] y[2] y[3] + + + x . y = Sum x[i]*y[i] DFG = Data Flow Graph While with other architectures you need to adapt your program to the architecture, with FPGA you adapt the architecture to your program. Each cycle a new result after 7 flops
  • 9. 2016-10-19 Roberto Innocente inno@sissa.it 9 III. The FPGA of our tests
  • 10. 2016-10-19 Roberto Innocente inno@sissa.it 10 INTEL/Altera Arria 10 ● This was the first FPGA on the market to offer native floating point multiply/add in its DSPs. ● That's the reason why we bought it. ● Of course also on the other large FPGAs you can implement floating point ops if you want, using the IP cores offered by vendors. NB. IP core : a function implemented in schematics or an HDL not free, but proprietary (IP = Intellectual Property)
  • 11. 2016-10-19 Roberto Innocente inno@sissa.it 11 INTEL (Altera) Arria 10 ● INTEL Arria 10 GX1150 : – Logic Elements 1,150 K – ALMs 427,200 – Registers 1,708,800 – M20K mem block 2,713 – DSP 1,518 (integer and float SP) ● Back of the envelope calculation : each DSP can output a SinglePrecision Fused Multiply Add per cycle – 2 × 1518 = 3036 flops × 0.5Ghz = 1500Gflop/ s INTEL bought Altera in 2015 and now they start to re-brand everything. To avoid short-term obsolescence I will call it INTEL FPGA
  • 12. 2016-10-19 Roberto Innocente inno@sissa.it 12 From INTEL/ Altera docs
  • 13. 2016-10-19 Roberto Innocente inno@sissa.it 13 IV. How to measure performance of new architectures ?
  • 14. 2016-10-19 Roberto Innocente inno@sissa.it 14 The “Seven dwarfs” At the dawn of the many core and heterogenous new computer architectures, Phil Colella of LBL, wrote the presentation Defining Software Requirements for Scientific Computing, in which he claimed that all new architectures should measure themselves with seven computational kernels common across every branch of scientific computing. These computational kernels were later cosy-named Seven Dwarfs, because like in the SnowWhite fairy tale they should be mining for gold in new Computer Architectures. “A dwarf is an algorithmic method that captures a pattern of computation and communication. “ http://view.eecs.berkeley.edu/wiki/Dwarf_Mine The dwarfs grew with time to 13. High-end simulation in the physical sciences consists of seven algorithms: • Structured Grids (stencils, including locally structured grids, e.g. AMR) • Unstructured Grids • Fast Fourier Transform • Dense Linear Algebra • Sparse Linear Algebra • Particles • Monte Carlo Phil Colella,2004 (LBL)
  • 15. 2016-10-19 Roberto Innocente inno@sissa.it 15 V. Arithmetic Intensity AI
  • 16. 2016-10-19 Roberto Innocente inno@sissa.it 16 Arithmetic/Computational Intensity (AI) AI= FLOPS bytestransferred from/to offchip memory AI SingleP AI DoubleP Vector addition z[i] = x[i] + y[i] 1/12 0.083 1/24 Scalar product Σ a[i] * b[i] ¼ 0.125 1/8 Vector magnitude Σ a[i] * a[i] ½ 0.500 ¼ SAXPY 1/6 0.375 1/12 Stencil 4 neighbors C[i,j] = a*A[i,j] +b*(A[i-1,j]+A[i+1,j]...) 5/24 0.208 5/48 Matrix Multiply C[i,j] = Σk A[k,i] * B[k,j] 1/4 0.125 1/8 FFT1d 0.9* log(N) 7.48 N=4096 α∗x+ y
  • 17. 2016-10-19 Roberto Innocente inno@sissa.it 17 AI Arithmetic Intensity From https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
  • 18. 2016-10-19 Roberto Innocente inno@sissa.it 18 VI. The Roofline Model (RM)
  • 19. 2016-10-19 Roberto Innocente inno@sissa.it 19 The Roofline Model Sam Williams; A.Waterman;D. Patterson (2009-04-01). "Roofline: An Insightful Visual Performance Model for Multicore Architectures" http://doi.acm.org/10.1145/1498765.1498785 ● It is an intuitive visual performance model to provide estimates of performance ● Based on two ceilings : – Peak Flop performance of the architecture – Maximum throughput of offchip memory
  • 20. 2016-10-19 Roberto Innocente inno@sissa.it 20 Arria10 Roofline Model Break Even Point @ AI = 250 0.01 0.1 1 10 100 1000 0.01 0.1 1 10 100 1000 AttainableGfop/s Arithmetic intensity (AI) fops/byte transferred Roofine model Limits are I/O bandwidth 6GB/s, Peak fops 1.5 Tfops 1/12vectoradd 1/4scalarproduct 3/8SAXPY 1/2Vectormagnitude 5/24Stencil4neigh Theoretical Peak (x<250)? 6*x:1500
  • 21. 2016-10-19 Roberto Innocente inno@sissa.it 21 VII. CUDA/OpenCL
  • 22. 2016-10-19 Roberto Innocente inno@sissa.it 22 Data parallelism / Task parallelism From www.fixtars.com Data Parallel Task Parallel T A S K 0 .. ¼ N ¼ N .. ½ N ½ N .. ¾ N ¾ N .. N T A S K T A S K T A S K SUM SUM
  • 23. 2016-10-19 Roberto Innocente inno@sissa.it 23 The rise of the CUDA/OpenCL model ● In the mid of the past decade it was clear that Moore law could be respected only through parallelism. ManyCore and Heterogenous computers appeared: GPUs, FPGAs, CPUs, DSPs ● GPUs with hundredths and then thousands of simple cores (forthcoming NVIDIA pascal ~ 3.800 [available from 2017] ) ● Data parallelism can be supported with a simple model (differently from task parallelism) : a compute pattern (kernel) instantiated on every core with a different set of indices. – a[i]+b[i] (Vector Addition kernel) – Σk a[i,k] * b[k,j] (Matrix multiplication kernel) – 1 /2 /3 dimensional NDRange / grid ● Each instantiation (work-item/thread) is provided with different parameters through a function call (e.g. get_global_id() , in fact the core computes displacements by itself knowing its wg and wi numbers )
  • 24. 2016-10-19 Roberto Innocente inno@sissa.it 24 NVIDIA pascal / GP100 ● GP100 (device) : ● SM (compute unit) : 2 x vector processors 32 SIMD (because only 1 PC per warp)
  • 25. 2016-10-19 Roberto Innocente inno@sissa.it 25 OpenCL for FPGAs ● There is a compiler front end (UIUC LLVM) for the HDL Place&Route (PAR) package (in the INTEL/Altera case Quartus Pro) ● For the FPGAs the compilers are all offline compilers. Why ? – It takes many hours or days of CPU to synthesize a complete project – Forget about Apple/NVIDIA examples in which OpenCL code is a string inside the host C++ program. – INTEL/Altera say you need 32 GB of main memory, but in fact I have seen the compilation processes to use 40/50 GB many times (so 64 GB is a better size). ● aoc : INTEL/Altera Offline Compiler : – aoc krnl.cl -o krnl.aocx
  • 26. 2016-10-19 Roberto Innocente inno@sissa.it 26 Host source code .c or .cpp Host compiler Host binary kernel source code .cl AOC FPGA binary .aocx Host code path Kernel code path Execute Host app On host (INTEL/Altera Offline Compiler)
  • 27. 2016-10-19 Roberto Innocente inno@sissa.it 27 FPGA/OpenCL ● OpenCL was born for different computer architectures and doesnt capture all possibilities FPGAs can offer. ● Anyway OpenCL for FPGA seems a mature product that offers a big step up in easy to obtain FPGAs performance.
  • 28. 2016-10-19 Roberto Innocente inno@sissa.it 28 VIII. Actual performance
  • 29. 2016-10-19 Roberto Innocente inno@sissa.it 29 Results Reported ● All the results reported here were obtained using INTEL/Altera OpenCL compiler 16.0.0 Build 211 and same version Quartus Pro ● In a future report I will discuss Verilog results.
  • 30. 2016-10-19 Roberto Innocente inno@sissa.it 30 Vector Addition ● z[i] = x[i] + y[i] ● Computational intensity very low : – ● Limit is then from I/O: – 6 GB* 1/12 = 0.5 Gflops/s  ./vector_add  Initializing OpenCL Platform: Altera SDK for OpenCL Performance on CPU 1 core of intel i7 :     Processing time on CPU   = 1.1313ms    Mflops/s 883.948201 Launching for device 0 (1000000 elements) Performance on FPGA :    Processing time on FPGA  = 6.5348ms    Mflop/s on FPGA= 153.027972    Time: 6.535 ms    Kernel time (device 0): 3.668 ms AI= 1 12
  • 31. 2016-10-19 Roberto Innocente inno@sissa.it 31 Stencil code From PDE Substitute derivatives with discrete approximation using a symbolic algebra package Difference Equation Stencil code Code that updates a point using the neighbor point values 3D stencil of order 8
  • 32. 2016-10-19 Roberto Innocente inno@sissa.it 32 Stencil code/2 ● 5043 = 0.128 G points in lattice ● 5 time steps : – 0.128 * 5 = 0.640 G points processed ● 321 ms : – 0.640/.321 = 1.993 Gpoints/s processed ● 24 neighbors + 1 = 25 * 2 ops = – 50 ops ● 1.993 Gpoints/s * 50 flops = – 99 Gflops/s on FPGA ● On a single core of intel i7 cpu : – 0.85 Gflop/s ● Arithmetic Intensity : –   $ ./stencil  Volume size: 504 x 504 x 504 order­8 stencil computation for 5 time steps Performance on FPGA :       Processing time : 321 ms       Throughput = 1.9897 Gpoints / sec       Gflops per second  99.486999 Performance on CPU intel i7 1 core :        Processing time on cpu = 37524.9531ms       Throughput on cpu = 0.0171 Gpoints / sec       Gflops per second  on cpu 0.852926       Verifying data ­­> PASSED CI= 2×25×N3 4×25×N3 = 1 2 … but ..
  • 33. 2016-10-19 Roberto Innocente inno@sissa.it 33 Matrix Multiplication ● Matrix sizes: – A: 2048 x 1024 – B: 1024 x 1024 – C: 2048 x 1024 ● FPGA 128.77 Gflops ● CPU 1.48 Gflops (on 1 core of Intel i7) ● Generating input matrices Launching for device 0 (global size:  1024, 2048) Performance of FPGA :    Time: 33.353 ms    Kernel time (device 0): 33.294 ms    Throughput: 128.77 GFLOPS Computing reference output Performance of CPU Intel i7 single core :    Time: 2907.730 ms    Throughput: 1.48 GFLOPS AI= N2 ×(2×N−1) 4×3×N 2 = 1 6 … but ..
  • 34. 2016-10-19 Roberto Innocente inno@sissa.it 34 A B CTile Find 2 slices of A rows and B cols that you can keep in fast memory, then you can compute the corresponding tile of C without accessing any other data (data re-use due to caching). This can increase a lot the Arithmetic Intensity. More efficient Matrix Multiplication If you can store stripes large k, then you read B only once and A N/k times. AI= 2N k2 N2 k2 N 2 + N k ×N 2 = 2N3 N 2 (1+ N k ) = 2 1 k + 1 N ≈2k
  • 35. 2016-10-19 Roberto Innocente inno@sissa.it 35 Matrix Multiplication/2 +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+ ; Estimated Resource Usage Summary                                   ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­+ ; Resource                               + Usage                     ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­+ ; Logic utilization                      ;   42%                     ; ; ALUTs                                  ;   17%                     ; ; Dedicated logic registers              ;   25%                     ; ; Memory blocks                          ;   40%                     ; ; DSP blocks                             ;   31%                     ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­;
  • 36. 2016-10-19 Roberto Innocente inno@sissa.it 36 Matrix Multiplication/3 +–­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+ ; Resource                                    ; Usage           ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+ ; Estimate of Logic utilization (ALMs needed) ; 64255           ; ;                                             ;                 ; ; Combinational ALUT usage for logic          ; 49717           ; ;     ­­ 7 input functions                    ; 32              ; ;     ­­ 6 input functions                    ; 12400           ; ;     ­­ 5 input functions                    ; 1882            ; ;     ­­ 4 input functions                    ; 5526            ; ;     ­­ <=3 input functions                  ; 29877           ; ;                                             ;                 ; ; Dedicated logic registers                   ; 122269          ; ;                                             ;                 ; ; I/O pins                                    ; 0               ; ; Total MLAB memory bits                      ; 0               ; ; Total block memory bits                     ; 5220584         ; ;                                             ;                 ; ; Total DSP Blocks                            ; 392             ; ;     ­­ Total Fixed Point DSP Blocks         ; 8               ; ;     ­­ Total Floating Point DSP Blocks      ; 384             ; ;                                             ;                 ; ; Maximum fan­out node                        ; clock_reset_clk ; ; Maximum fan­out                             ; 147565          ; ; Total fan­out                               ; 914768          ; ; Average fan­out                             ; 4.56            ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+
  • 37. 2016-10-19 Roberto Innocente inno@sissa.it 37 FFT 1d ● AI ~ 7.48 ● FPGA ~ 120 Gflop/s Fixed 4k points transform Launching FFT transform for 2000 iterations FFT kernel initialization is complete. Processing time = 4.0878ms Throughput = 2.0040 Gpoints / sec (120.2420 Gflops) Signal to noise ratio on output sample: 137.677661  ­­> PASSED Launching inverse FFT transform for 2000 iterations Inverse FFT kernel initialization is complete. Processing time = 4.0669ms Throughput = 2.0143 Gpoints / sec (120.8579 Gflops) Signal to noise ratio on output sample: 137.041007  ­­> PASSED AI= 5∗N∗log(N)/log(2) 4∗2∗N 5 8 ∗ log(N) log(2) … but ..
  • 38. 2016-10-19 Roberto Innocente inno@sissa.it 38 FFT 2d ● ~ 66 Gflop/s ● Launching FFT transform (alternative data layout) Kernel initialization is complete. Processing time = 1.5787ms Throughput = 0.6642 Gpoints / sec (66.4201 Gflops) Signal to noise ratio on output sample: 137.435876  ­­> PASSED Launching inverse FFT transform (alternative data  layout) Kernel initialization is complete. Processing time = 1.5781ms Throughput = 0.6644 Gpoints / sec (66.4440 Gflops) Signal to noise ratio on output sample: 136.689050  ­­> PASSED AI= 5∗N2 ∗log(N2 ) 4∗2∗N 2 ∗log(2) 5∗2∗log(N) 4∗2∗log(2) = 5 4 ∗ log(N) log(2)
  • 39. 2016-10-19 Roberto Innocente inno@sissa.it 39 Computing π with Montecarlo Computes π with a Mersenne twister rng. Points = 222 GlobalWS=WG 32 , LocalWS=WI 32 I. 32x32 WI = 1024 generate 4096 rn in [0,1]x[0,1] = 222= 4194304 II.For each batch of 4096 rn computes ins and outs respect to the circle III.Computes average It takes ~ 854 ns for each rn Using AOCX: mt.aocx Reprogramming device with handle 1 Count all 26354932.000000 / ( rn  4096 * rng 32 *32) * 4 Computed pi  = 3.141753  Mersenne twister : 1954.146849[ms]  Computing     pi : 1632.340594[ms]  Copy results     :    0.077611[ms]  Total time       : 3586.565054[ms]
  • 40. 2016-10-19 Roberto Innocente inno@sissa.it 40 Sobel filter ● 1920 x 1080 pixels image, 3 x 8 planes color ~ 6 MB ● Filter can be applied 140 fps ● luma=(([R G B] [ 66 129 25 ]+128)≫8)+16 RecBT709 Sobel Operators Sx= [ −1 −2 −1 0 0 0 1 2 1]Sy= [ −1 0 +1 −2 0 +2 −1 0 +1]∂I ∂x =I∗Sx , ∂I ∂ y =I∗Sy ∇I=[∂I ∂x ∂I ∂ y ], ‖∇ I‖= √(∂I ∂ x) 2 +(∂ I ∂ y) 2 ‖∇ I (i , j)‖ < θ → pixel(i, j)=(0,0,0) Convolution
  • 41. 2016-10-19 Roberto Innocente inno@sissa.it 41 Other implementations ● Smith-Waterman – Algorithm for computing the best match (with gaps and mismatches) between 2 DNA sequences Status : in progress ● Spiking Neurons – McCulloch-Pitts (and later Rosenblatt perceptron) are too simple models of neuron communication. In fact neurons for sure use spikes frequency to signal strength of activation or maybe even use spikes as a kind of binary code between them Status: thought about it
  • 42. 2016-10-19 Roberto Innocente inno@sissa.it 42 IX. More on OpenCL for FPGA
  • 43. 2016-10-19 Roberto Innocente inno@sissa.it 43 OpenCL https://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf Originally authored by Apple, bored by the need to support all the new coming computing devices (NVIDIA, AMD, Intel,.. ). (2007/2008) It goes mostly along the lines of the predecessor NVIDIA CUDA but using a different terminology. The rights were passed to a consortium that develops standards : Khronos. This consortium develops also the OpenGL standard (2008/2009).
  • 44. 2016-10-19 Roberto Innocente inno@sissa.it 44 OpenCL platform model 1 host + 1 or more compute devices Host Compute device Compute unit PE Processing Element
  • 45. 2016-10-19 Roberto Innocente inno@sissa.it 45 OpenCL platform model and FPGAs FPGA : ● A Compute Device is an FPGA card (there can be many in a PC) ● A Compute Unit is a pipeline instantiated by the FPGA OpenCL compiler (you can implement multiple pipelines on the FPGA : you will see in a next slide). ● A Processing Element (PE) is e.g. a DSP adder or multiplier in a pipeline. NVIDIA CUDA : ● A Compute Device is an NVIDIA CUDA card ● A Compute Unit is a Streaming Multiprocessor (SM) ● A Processing Element (PE) is a CUDA core (on NVIDIA all cores in a warp execute the same instruction)
  • 46. 2016-10-19 Roberto Innocente inno@sissa.it 46 OpenCL / CUDA Data Parallel Model OpenCL : ● NDRange ● WorkGroup ● WorkItem CUDA : ● Grid ● ThreadBlock ● Thread The problem is represented as a computation carried over a 1,2 or 3 dimensional array.
  • 47. 2016-10-19 Roberto Innocente inno@sissa.it 47 OpenCL NDRange, work-group, work-item From Intel https://software.intel.com/sites/landingpage/opencl/optimization-guide/Basic_Concepts.htm CUDA grid CUDA threadblock CUDA thread
  • 48. 2016-10-19 Roberto Innocente inno@sissa.it 48 OpenCL attributes for FPGA #define NUM_SIMD_WORK_ITEMS  4   #define REQD_WORK_GROUP_SIZE (64,1,1)  #define NUM_COMPUTE_UNITS  2  #define MAX_WORK_GROUP_SIZE 512   __kernel  __attribute__((max_work_group_size( MAX_WORK_GROUP_SIZE ))) __attribute__((reqd_work_group_size REQD_WORK_GROUP_SIZE )) __attribute__((num_compute_units( NUM_COMPUTE_UNITS ))) __attribute__((num_simd_work_items( NUM_SIMD_WORK_ITEMS ))) void function(..) { ...; }               But .. The compiler is mostly Resource Driven and often it does'nt obey to your will, despite the docs promises.
  • 49. 2016-10-19 Roberto Innocente inno@sissa.it 49 Vector Addition/Matrix Multiplication OpenCL kernels // vector addition  C:   for(i=0;i<N;i++){          C[i] = A[i]+B[i];   }         OpenCL: __kernel void vecadd(__global const float* A,                 __global const float* B,                 __global float* C) {       i = get_global_id(0);       C[i] = A[i] + B[i]; } // matrix multiplication C:   for(i=0;i<N;i++){     for(j=0;j<N;j++){       Temp = 0.0f;       for(k=0;k<N;k++){         Temp += A[i][k] * B[k][j]       }       C[i][j] = Temp;     }      } OpenCL: __kernel void matmul(__global const float* A,                      __global const float* B,                      __global float* C, ) {       __local float sum;       i = get_global_id(0);       j = get_global_id(1);       sum = 0.0f;       for(k=0;k<N;k++) {           sum += A[i][k] * B[k][j];       }       C[i][j] = sum; }
  • 50. 2016-10-19 Roberto Innocente inno@sissa.it 50 X. Getting the most we need to look at the architecture !
  • 51. 2016-10-19 Roberto Innocente inno@sissa.it 51 Arria 10 DSP in Floating Point mode
  • 52. 2016-10-19 Roberto Innocente inno@sissa.it 52 Arria 10 killah kernel Initializing OpenCL Platform: Altera SDK for OpenCL Using 1 device(s) p385a_sch_ax115 : nalla_pcie (aclnalla_pcie0) Using AOCX: loop.aocx Reprogramming device with handle 1 Launching for device 0 (100000 elements) Total runs 100000 , gflop 107374.182400 100,000 x 4 x (256*1024*1024) Wall Time: 139909.012 ms Gflop/s 767.457225 Kernel time (device 0): 139908.517 ms Gflop/s 767.459945 2.0 * 2.0 + 0.5 * ­1.0 + x[i] res #define N (256*1024*1024) __kernel void loop (__global const float* x,  __global float *restrict  y) { local float res; int i = get_global_id(0); res  = x[i]; #pragma unroll 700   for(i=0;i<N;i++){     res = res*2.0f + 2.0f;     res = res*0.5f – 1.0f;   }   y[i] = res; }
  • 53. 2016-10-19 Roberto Innocente inno@sissa.it 53 Arria 10 killah kernel – Quartus report +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+ ; Spectra­Q Synthesis Resource Usage Summary for Partition "|"  ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+ ; Resource                                    ; Usage           ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+ ; Estimate of Logic utilization (ALMs needed) ; 82174           ; ;                                             ;                 ; ; Combinational ALUT usage for logic          ; 102803          ; ;     ­­ 7 input functions                    ; 5               ; ;     ­­ 6 input functions                    ; 1842            ; ;     ­­ 5 input functions                    ; 11104           ; ;     ­­ 4 input functions                    ; 18594           ; ;     ­­ <=3 input functions                  ; 71258           ; ;                                             ;                 ; ; Dedicated logic registers                   ; 151334          ; ;                                             ;                 ; ;  ;I/O pins                                     ; 0               ; ; Total MLAB memory bits                      ; 0               ; ; Total block memory bits                     ; 1348604         ; ;                                             ;                 ; ; Total DSP Blocks                            ; 1400            ; ;     ­­ Total Fixed Point DSP Blocks         ; 0               ; ;     ­­ Total Floating Point DSP Blocks      ; 1400            ; ;                                             ;                 ; ; Maximum fan­out node                        ; clock_reset_clk ; ; Maximum fan­out                             ; 155035          ; ; Total fan­out                               ; 692846          ; ; Average fan­out                             ; 2.65            ; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+ This is extremely good. It shows that the OpenCL compiler really created the same design an experienced hardware engineer could have created using Verilog. It used 2 DSP in mul and add mode for each line of the loop. And it reached a performance of 50/60 % of the peak.
  • 54. 2016-10-19 Roberto Innocente inno@sissa.it 54 XI. Programming FPGAs with Schematics Only very small project can be handled using schematics
  • 55. 2016-10-19 Roberto Innocente inno@sissa.it 55 Quartus – FPGA using Schematics 1. Create new project with wizard (give a dir and a project name), select an empty project 2. Choose FPGA model:10AX115N3F40E2SG 3. Open New File, choose a Design->Schematic : a design whiteboard opens up 4. Choose all the components you need : i/o pins, dsp blocks (choose them in integer or fp mode form the IP catalog, a parameter editor will open up and you can program them to be adders/multipliers/fma ) 5. Connect components with busses from the top menu
  • 56. 2016-10-19 Roberto Innocente inno@sissa.it 56 Quartus – Schematic for scalar product
  • 57. 2016-10-19 Roberto Innocente inno@sissa.it 57 Quartus report using Schematics
  • 58. 2016-10-19 Roberto Innocente inno@sissa.it 58 XII. Programming FPGAs with an HDL : Verilog Again the Scalar Product of 2 vecs of length 4 Large Projects can't be managed using Schematics : hundredths/thousands/tens of thousands of components, millions of interconnections , ...
  • 59. 2016-10-19 Roberto Innocente inno@sissa.it 59 top.v module top( x0,y0,x1,y1,x2,y2,x3,y3,z,clk,ena,aclr); input [31:0]x0; input [31:0]y0; input [31:0]x1; input [31:0]y1; input [31:0]x2; input [31:0]y2; input [31:0]x3; input [31:0]y3;       output [31:0]z; input clk; input ena; input [1:0]aclr; wire [31:0]ir0; wire [31:0]ir1; wire [31:0]ir2; wire [31:0]ir3;  wire [31:0]ir4; wire [31:0]ir5;     dsp_fp_mul m1(.aclr(aclr),.ay(x0),.az(y0),.clk(clk),.ena(ena),.result(ir0));     dsp_fp_mul m2(.aclr(aclr),.ay(x1),.az(y1),.clk(clk),.ena(ena),.result(ir1));     dsp_fp_mul m3(.aclr(aclr),.ay(x2),.az(y2),.clk(clk),.ena(ena),.result(ir2));     dsp_fp_mul m4(.aclr(aclr),.ay(x3),.az(y3),.clk(clk),.ena(ena),.result(ir3));              dsp_fp_add a1(.aclr(aclr),.ax(ir0),.ay(ir1),.clk(clk),.ena(ena),.result(ir4));     dsp_fp_add a2(.aclr(aclr),.ax(ir2),.ay(ir3),.clk(clk),.ena(ena),.result(ir5));     dsp_fp_add a3(.aclr(aclr),.ax(ir4),.ay(ir5),.clk(clk),.ena(ena),.result(z)); endmodule top.v dsp_fp_add.v a1 dsp_fp_mul.v m4 dsp_fp_mul.v m1 dsp_fp_mul.v m3 dsp_fp_mul.v m2 dsp_fp_add.v a3 dsp_fp_add.v a2 In Verilog what seems a function call in fact is an instantiation of a circuit inside another. The parameter syntax represents the correspondence (connection) of wires with wires.
  • 60. 2016-10-19 Roberto Innocente inno@sissa.it 60 dsp_fp_xxx // dsp_fp_mul.v // Generated using ACDS version 16.0 211 `timescale 1 ps / 1 ps module dsp_fp_mul ( input  wire [1:0]  aclr,   //   aclr.aclr input  wire [31:0] ay,     //     ay.ay input  wire [31:0] az,     //     az.az input  wire        clk,    //    clk.clk input  wire        ena,    //    ena.ena output wire [31:0] result  // result.result ); dsp_fp_mul_altera_fpdsp_block_160_ebvuera fpdsp_block_0 ( .clk    (clk),    //    clk.clk .ena    (ena),    //    ena.ena .aclr   (aclr),   //   aclr.aclr .result (result), // result.result .ay     (ay),     //     ay.ay .az     (az)      //     az.az ); endmodule // dsp_fp_add.v `timescale 1 ps / 1 ps module dsp_fp_add (a,b,c,clk,ena,aclr); input wire [31:0]a; input wire [31:0]b; output wire [31:0]c; input wire clk; input wire ena; input wire [1:0]aclr; dsp_fp_add_altera_fpdsp_bloc_160_nmfrqti fdsp_block_0 ( .clk (clk), .ena(ena), .aclr(aclr), .ax     (a),     //     ax.ax .ay     (b),     //     ay.ay .result (c)  // result.result ); endmodule dsp_fp_mul.v dsp_fp_add.v These 2 modules are generated automatically when you instantiate from the IP cores a DSP in floating point mode and configure it like an adder or a multiplier
  • 61. 2016-10-19 Roberto Innocente inno@sissa.it 61 Quartus report on Scalar Product using HDL Exactly the same as for the project Using Schematics
  • 62. 2016-10-19 Roberto Innocente inno@sissa.it 62 System Verilog1 – Killah kernel sp_12.sv module sp_12 #( parameter N=700) ( input logic [31:0]x, output logic [31:0]out, input logic clk,ena, input logic [1:0]aclr ); logic [31:0] mul_2,add_2, mul_05,sub_1; logic [31:0]ir[2*N+4]; assign mul_2 = shortreal'(2.0); assign add_2 = shortreal'(2.0); assign mul_05 = shortreal'(0.5); assign sub_1 = shortreal'(-1.0); assign ir[0] = x; genvar i; generate for(i=0;i<=N;i=i+1) begin: FMA2_LOOP dsp_fp_fma inst ( .ax(add_2), .ay(ir[2*i]), .az(mul_2), .result(ir[2*i+1]), .clk(clk), .ena(ena), .aclr(aclr) ); dsp_fp_fma inst1( .ax(add_2), .ay(ir[2*i+1]), .az(mul_2), .result(ir[2*i+2]), .clk(clk), .ena(ena), .aclr(aclr) ); end endgenerate assign out = ir[2*N+2]; endmodule Quartus report : 1,402 DSP used1) SystemVerilog is a new edition of Verilog (1800-2012) with many additions
  • 63. 2016-10-19 Roberto Innocente inno@sissa.it 63 // dsp_fp_fma.v // Generated using ACDS version 16.0 211 `timescale 1 ps / 1 ps module dsp_fp_fma ( input  wire [1:0]  aclr,   //   aclr.aclr input  wire [31:0] ax,     //     ax.ax input  wire [31:0] ay,     //     ay.ay input  wire [31:0] az,     //     az.az input  wire        clk,    //    clk.clk input  wire        ena,    //    ena.ena output wire [31:0] result  // result.result ); dsp_fp_fma_altera_fpdsp_block_160_fj4u2my fpdsp_block_0 ( .clk    (clk),    //    clk.clk .ena    (ena),    //    ena.ena .aclr   (aclr),   //   aclr.aclr .result (result), // result.result .ax     (ax),     //     ax.ax .ay     (ay),     //     ay.ay .az     (az)      //     az.az ); endmodule Verilog : dsp_fp_fma.v This file is generated automatically when you instantiate a DSP as a multiplier/adder with the parameter editor. It differs from the others that resulted from single operation instantiation (like only mul or only add) : it uses all 3 input busses as you can see.
  • 64. 2016-10-19 Roberto Innocente inno@sissa.it 64 XIII. Spatial Computing (OpenSPL)
  • 65. 2016-10-19 Roberto Innocente inno@sissa.it 65 OpenSPL Open Spatial Programming Language ● Buzzword in the hands of a consortium leaded by Maxeler and Juniper on the industrial side, Stanford Uni , Imperial College, Tokjo Uni .. on the academic side ● Everything kept as a trade secret for now ● Java interface .. ● IMHO this is a lost occasion : – “Spatial Programming” is probably the wrong word in these times in which thousand of things around GPS, GEO, etc .. are already called in this way – Plans and standards should be open and not kept as a secret except for consortium members. – The industrial members are weak on this market – Java in this scene is, IMHO, not the right tool – An open source movement should be started instead
  • 66. 2016-10-19 Roberto Innocente inno@sissa.it 66 My Proposal: json-graph-fpga Use a simple and already existing format to describe the graph of components. Json for instance, or Json-graph. (We assume all components become connected to a global clock) { “inputs”:[“x0”,”x1”,”x2”,”x3”, “y0”,”y1”,”y2”,”y3”], “x0”:[“m1”],“y0”:[“m1”], “x1”:[“m2”],”y1”:[“m2”], “x2”:[“m3”],”y1”:[“m3”], “x3”:[“m4”],”y1”:[“m4”], “m1”:[“a1”],”m2”:[“a1”], “m3”:[“a2”],”m4”:[“a2”], “a1”:[“a3”],”a2”:[“a3”], “a3”:[“outputs”] } I n p u t s O u t p u t s * m1 * m2 * m3 * m4 + a1 + a2 + a3
  • 67. 2016-10-19 Roberto Innocente inno@sissa.it 67 XIV. What's next ?
  • 68. 2016-10-19 Roberto Innocente inno@sissa.it 68 Top INTEL/Altera Product Stratix 10 ● Arria 10 (10AX115) – 20nm technology – – Log El 1,150,000 – ALM 472,500 – DSP 1,518 – M20Blk 2,713 – Reg 1,708,800 – PeakTflops 1.5 ● Stratix 10(GX2800) – Intel 14nm (TriGate) FinFET – Log El 2,753,000 – ALM 933,120 – DSP 5,760 – M20Blk 11,721 – Reg 3,732,480 – PeakTflops 10 Stratix 10 = 6 x (fp perf of Arria 10)
  • 69. 2016-10-19 Roberto Innocente inno@sissa.it 69 How to lift off-board b/w limitations ? Directly to QPI or PCIe ● Connect directly to the Intel QPI (Quick Path Interconnect) or the future Intel UPI (Ultra Path Interconnect) , processor/chipset point to point interconnect (60-80 GB/s). Already done with Xilinx chips ● Stratix 10 supports 4x PCIe Gen3x16 ~ 60 GB/s Stand alone ● Use FPGAs stand alone. The Stratix 10 supports DDR4 memory or HMC (Hybrid Memory Cube). Connections with Interlaken channels support 14.7 Gb/s per lane.
  • 70. 2016-10-19 Roberto Innocente inno@sissa.it 70 XV. Competitors
  • 71. 2016-10-19 Roberto Innocente inno@sissa.it 71 Competitors ● NVIDIA P100 (next year) – 3,584 cores – 1,328 Mhz 300 W, 1,126 Mhz 250 W – CUDA 6.0 – Single Precision Gflops 8,000-10,000 – 3584*1328*2 = 9,519 TFlops – TDP 250-300 Watt – ~ 10,000-12,000 USD ● INTEL Xeon Phi 7290 – 72 cores – Freq. 1.50 Ghz – TDP 245 Watt – ~ 4,110 USD ● INTEL Xeon E5-4699v4 – 22 cores – Freq. 2.20 Ghz – TDP 135 Watt – ~ 7,000 USD ● INTEL Arria 10 GX – 1518 DSP – Freq. 0.5 Ghz – Peak 1518*2*0.5 = 1.5 Tflops – TDP ~ 30 Watt – ~ 5,000 USD INTEL/Altera FPGAs NVIDIA GPGPU INTEL Xeon Phi INTEL Xeon ● INTEL Stratix 10 – 5,760 DSP – Freq. 1.0 Ghz – Peak 10 Tflops – TDP ~ 30-40 Watt – ~ ??? 20 K USD
  • 72. 2016-10-19 Roberto Innocente inno@sissa.it 72 Competitors/2 Arria 10 Stratix10 INTEL E5-2699v4 INTEL Phi 7290 NVIDIAP100 0 5000 10000 15000 20000 25000 30000 35000 TDP / Peak GFlop/s / Price TDPWatt x100 Peak FP- GFlop Price
  • 73. 2016-10-19 Roberto Innocente inno@sissa.it 73 Competitors/3 Arria 10 Stratix10 INTEL E5-2699v4 INTEL Phi 7290 NVIDIAP100 0 50 100 150 200 250 300 GFlops / Watt GFlops/Watt
  • 74. 2016-10-19 Roberto Innocente inno@sissa.it 74 XVI. Can I use it ?
  • 75. 2016-10-19 Roberto Innocente inno@sissa.it 75 Can I use it ? – I'm interested in making comparisons tests with Tesla and other architectures – I'm interested in trying kernels with sufficient Arithmetic Intensity to run efficiently – I'm interested in interesting problems :) – The limit is the fact that there is only 1 board on 1 PC and the compiler license is for 1 seat. About this please write to me !
  • 76. 2016-10-19 Roberto Innocente inno@sissa.it 76 "and go on till you come to the end: then stop.” Lewis Carrol but I think also Jacques De La Palice (or de La Palisse) could have said something like that
  • 77. 2016-10-19 Roberto Innocente inno@sissa.it 77 END