5. CPU
5
Instruction
Cache Tag
uOP Cache
Tag
(Micro-Fusion/Macro-Fusion)
Store Data
Load AddressALU&Shift
Store Address
Load Address
Store Address
Port0
Port1
Port5
Port2
Port3
Port4
Zeroing Ideoms
2x32 Bytes/Cycle
Load
(42 entries)
32 Bytes/Cycle Store
8-way,11CycleLatency
Data TLB
(56 uOPs)
Up to 4 Fused uOPs
Branch
Divide
256-bit FMA(Multiply-Add)
256-bit FP Multiply
Vector Integer Multiply
Vector Logicals
Vector Shift
ALU
LEA(Load Effective Address)
Multiply
256-bit FMA(Multiply-Add)
256-bit FP Add
Vector Integer ALU
Vector Logicals
ALU
LEA(Load Effective Address)
Vector Shuffle
Vector Integer ALU
Vector Logicals
Port6
ALU&Shift
Branch
Store Address
Port7
64 Bytes/Cycle
uOPs
uOPs
uOPs
uOPs
uOPs
uOPs
uOPs
uOPs
http://images.anandtech.com/doci/6985/DT_Haswell_i7_FB_678x452.jpg
http://pc.watch.impress.co.jp/video/pcw/docs/665/735/p10.pdf
6. CPU
6
Writing a software
in programming languages
Preprocess
Compile
Assemble
Link
CompilerFlow
Execution on a CPU
ELF01ABF00F1...
Executable Binary
int main(){
int a = 1 + 2;
printf(“Hello %d¥n”, a);
return 0;
}
C source code
add $t0, $t1, $t2
li $v0, 1
syscall
Assembly language
8. Instruction fetch (IF)
8
Register
File
ALU
Main Memory
Instruction
Control
Program
Counter
Processor
Operation
supply
Data
read/write
Compute by consuming data
Instruction
position
Data
read/write
Instruction
fetch
Data Instruction
Read an instruction from
main memory
9. Instruction decode (ID)
9
Register
File
ALU
Main Memory
Instruction
Control
Program
Counter
Processor
Operation
supply
Data
read/write
Compute by consuming data
Instruction
position
Data
read/write
Instruction
fetch
Data Instruction
Decode the instruction
and determine the ALU
operation
10. Register fetch (RF)
10
Register
File
ALU
Main Memory
Instruction
Control
Program
Counter
Processor
Operation
supply
Data
read/write
Compute by consuming data
Instruction
position
Data
read/write
Instruction
fetch
Data Instruction Read values from
register file to ALU
12. Memory access (MA)
12
Register
File
ALU
Main Memory
Instruction
Control
Program
Counter
Processor
Operation
supply
Data
read/write
Compute by consuming data
Instruction
position
Data
read/write
Instruction
fetch
Data Instruction
Sometimes data move
from/to register from
to/from main memory
13. Write back (WB)
13
Register
File
ALU
Main Memory
Instruction
Control
Program
Counter
Processor
Operation
supply
Data
read/write
Compute by consuming data
Instruction
position
Data
read/write
Instruction
fetch
Data Instruction Write back the result to
register file
25. FPGA in Datacenters
n Microsoft Bing Search Engine (Catapult)
l More space density and energy efficiency than GPU
for machine learning (DNN)
25
Putnam+, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA'14
http://archive.eetindia.co.in/www.eetindia.co.in/STATIC/ARTIC
LE_IMAGES/201408/EEIOL_2014AUG14_PL_NT_01_03.jpg
26. FPGA for low-cost and energy-efficiency
26
http://www.wired.com/2014/06/microsoft-fpga/
Agile Co-Design for a Reconfigurable Datacenter, FPGA'16
27. Phenox: FPGA-based quadcopter
n Programmable drone system with FPGA
l Zynq: SoC FPGA (ARM CPU + FPGA logics in a single chip)
ü Easy to realize software with dedicated hardware support
27
Phenox http://phenoxlab.com/
28. CGRA
(Coase Grained Reconfigurable Architecture)
n
l
l EMAX[Tanomoto+, MCSoC'15]
28
Interconnection
DRAM
CPU
Core
PE PE PE PE
MemoryInterface
EMAX
PE PE PE PE
PE PE PE PE
30. Convolutional Neural Network (CNN)
n
l Convolution ( ):
l Pooling and Max-out: 1
l Full connection:
n :
l
l GPU
30
Input Layer Hidden Layers Output Layer
Convolution Pooling Max Out Convolution Full Connection
31. EMAX CNN [Tanomoto+,MCSoC2015]
n EMAX
l IoT
31
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
EMAX GTX980 GK20A Core i7 ARM
CIFAR-10:
1.41x better than
mobile GPU
Lenet:
1.75x better than
mobile GPU
32. n FPGA
LSI
l Google TPU
32https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html
33. n : RTL (Register Transfer Level)
l
l Timed
l
33
34. 2 (c += a * b)
34
RTL (Verilog HDL): 105 2098 15
L
35. n : RTL (Register Transfer Level)
l
l Timed
l
n : HLS: High Level Synthesis
l
l Untimed
ü
ü (Directive)
l
35
37. HLS
37
Writing a hardware
in programming languages
Synthesis
Technology Mapping
Place and Route
Bitstream Generation
EDAFlow
Configuration of the bitstream
to an FPGA
int sum(int array[1024]){
int ret;
for(i=0; i<1024; i++){
ret += array[i];
}
return ret;
}
1A0C021E...
Original HW on an FPGA
Bitstream
Lexical Analysis/Tokenize
Control-Dataflow analysis
Scheduling/Allocation
Code Generation of HDL
module sum(
input [31:0] array_in,
input array_in_valid, …
always @(posedge CLK) begin
…
sum <= sum + array_in;
end
endmodule
HLSFlow
38. Xilinx Vivado HLS
n Free (≠Open-source) compiler for Xilinx FPGAs
l Synthesize Verilog HDL/VHDL from C/C++
l Eclipse-based IDE
38Xilinx UG902
39. Altera OpenCL
n OpenCL: parallel programming language for
heterogeneous platforms
n Synthesize Host-SW & FPGA-HW at same time, like GPU
39
http://www.bdti.com/InsideDSP/2013/02/13/Altera
40. OK
n No.
n
l I/F
n RTL
l Trax RTL
n :
RTL
l Chisel[Bachrach+,DAC'12]
l PyMTL[Lockhart+,MICRO’14]
l Synthesijer.Scala[ ,IEICE RECONF'15]
40
41. Veriloggen:
Python RTL
41
Design Generator by Python
from veriloggen import *
m = Module('blinkled')
clk = m.Input('CLK')
led = m.Output('LED', 8)
count = m.Reg('count', 32)
m.Assign( led(count[31:24]) )
m.Always(Posedge(clk)(
count( count + 1 ) )
hdl = m.to_verilog()
print(hdl)
blinkled
CLK RST
LED count
assign
always
Veriloggen Object
module blinkled (
input CLK,
output [7:0] LED
);
reg [31:0] count;
assign LED = count[31:24];
always @(posedge CLK) begin
count <= count + 1;
end
endmodule
Verilog Source Code
module
input
CLK
input
RST
blinkled
Verilog AST
to_verilog()
Verilog
AST
Generator
Verilog
Code
Generator
Run on Python Interpreter
Verilog HDL
Python
Verilog HDL
44. Veriloggen is available!
n GitHub
l Veriloggen: https://github.com/PyHDI/veriloggen
l PyCoRAM: https://github.com/PyHDI/PyCoRAM
l Pyverilog: https://github.com/PyHDI/Pyverilog
n PIP Python
44
$ pip install veriloggen
$ pip install pyverilog
$ pip install pycoram
$ git clone https://github.com/PyHDI/veriloggen.git
$ git clone https://github.com/PyHDI/Pyverilog.git
$ git clone https://github.com/PyHDI/PyCoRAM.git
45. n
l
ü C, C++, C#, Java, Python, Ruby, Perl, JavaScrit, Scala, Go, Haskell
l →
ü RTL: Verilog HDL, VHDL
ü HDL: Chisel (Scala DSL), PyMTL (Python DSL), Veriloggen
ü : C, C++, OpenCL, Java (Synthesijer), Python (PyCoRAM)
n
ü ≠ C C
ü Ruby Go Python
45