"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
1. 14:30 – 15:00 June 2, 2011
HEART 2011 @Imperial College London
An FPGA-based Scalable Simulation
Accelerator for Tile Architectures
Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†,
Yoshito Sakaguchi†, Kenji Kise†
†Tokyo Institute of Technology, Japan
‡JSPS Research Fellow
2. This presentation shows ScalableCore system
n Multi-FPGA system for Tile architecture simulations
l Achieving SCALABLE simulation speed
Target Core
System
Function
3. Agenda
n Background & Motivation
n Proposal: ScalableCore
n System Implementation
l Overall system
l Components: ScalableCore Unit & Board
l Logic Hierarch & Architecture
n Evaluation
l Simulation Speed
l Power
n Conclusion
4. Background: Multicores to Many-cores
Intel Single Chip Cloud Computer
48 cores (x86)
TILERA TILE-Gx100
100 cores (MIPS)
5. Simulation Target Manycore: M-Core
n Tile architecture with 2D mesh network
l A Node has: Core, Local Memory, INCC (DMA controller) and Router
l Local Memory: Independent Address Space, Data transfer by DMAs
DRAM Controller DRAM Controller
Local
Memory
Core
INCC
R
Node
DRAM Controller DRAM Controller
6. How to evaluate the architectures?
n Customizability vs. Simulation Speed
l We want to run a large benchmark fast
Reality
Chip
Easy construction of
ideal system without
HW limitations FPGA
Real but
Simulator
expensive
Software
Faster simulation and
Simulator
customizable
Difficulty to construct
7. Less scalability of simulation speed
on software simulators
n Decreasing speed with the increasing # target cores
l SimMc :M-Core simulator
l Difficult to achieve the scalable speed
• Overhead for cycle accurate simulation
400
343 Speed degradation
350
more than the increasing # cores
Simulation Speed
300
[K cycle / sec]
250
200
149
150
96
100 70
50
0
16 32 48 64
# Target Cores
Simulation Speed on SimMc (M-Core simulator)
8. Motivation
n Achieve the SCALABLE simulation speed
l = Keep the constant simulation speed in case of large number of
cores
n How to scale the simulation speed?
l Our target architecture: M-Core
• Tile architecture with 2D mesh network
Partitioning of the target processor into multiple FPGAs
Partition
Many-core
Processor
9. Proposal of ScalableCore
n Multiple FPGAs corresponding to the target processor
l Each ScalableCore Unit has a part of the target processor
and shares the simulation progress with its neighbor Units
ScalableCore Unit
(FPGA Card with off-chip Memory)
A part of the target processor
ScalableCore Board
Connecting among
the ScalableCore Units
LCD Display
for simulation information
Target Core
System
Function
Target Processor (M-Core)
10. Simulation Target Manycore: M-Core
n Tile architecture with 2D mesh network
l A Node has: Core, Local Memory, INCC (DMA controller) and Router
l Local Memory: Independent Address Space, Data transfer by DMAs
DRAM Controller DRAM Controller
Local
Memory
Core
INCC
R
Node
Current Target of
ScalableCore system
DRAM Controller DRAM Controller
11. ScalableCore system 1.1: Overview
n Simulating the M-Core with up to 64 Nodes (= FPGAs)
Local
Memory
Core
INCC
R
System Functions
Able to increase/decrease
the number of Nodes
15. 64 Nodes (8×8) : 64 ScalableCore Units
Scalable Extension!
16. ScalableCore system 1.1: Components
n ScalableCore Unit
FPGA board with off-chip SRAM
l Xilinx Spartan-3E XC3S500E
l 512KBi SRAM (8bit, 1 port for read/write)
l Configuration ROM
n ScalableCore Board
Interface board bridging Units
l Power regulator & SD card slot
17. ScalableCore system 1.1:Logic Hierarchy
Core INCC Router
Local Memory
Target Core (Interface)
(a Node in M-Core)
Interface Register Arbiter
System Functions
Memory Multiplexer Ser/Des
Device Controller Initializer
18. ScalableCore system 1.1:Logic Architecture
Off-chip
SRAM
SRAM Controller SD Card Controller Devices
Node Memory
Memory Controller DMA Register
SD
Memory Multiplexer
IR IR IR IR
Configuration
ROM JTAG
Memory DMA XCF04S port
Fetch Unit Generator/
Access Unit
Receiver
INCC
Register Interface Interface
Decoder
File Register Register
Router
Execution Unit Arbiter
Core
State Machine Controller IR IR
XBAR
to/from
Adjacent Units
Clock
Ser/Des
Reset
Ser/Des
IR Ser/Des
ScalableCore Unit
FPGA Spartan-3E Ser/Des
19. Two key techniques
n Local Barrier Synchronization
l Each FPGA has one Node of M-Core (or other tile architecture)
l To satisfy the cycle accuracy, hand shaking of simulation state is
needed
• All-to-All hand shake: Increasing overhead to the number of cores
l Our target is a tile architecture, so …
Hand shaking by only 4 neighbors
n Virtual Cycle
l How to emulate the complex hardware?
• Ex.) larger number of memory ports
Use multiple FPGA cycles for 1 target cycle
20. Local Barrier Synchronization
n Handshakes with 4 neighbor FPGAs
l Constant handshaking overhead, not increasing with the
increasing of # target cores
l So it achieves scalable simulation speed
Sending to Unit 0 Sending to Unit 0
Sending to Unit 1 Sending to Unit 1
0 Sending to Unit 2 Sending to Unit 2
Sending to Unit 3 Sending to Unit 3
3 4 1 Receiving from Unit 0 Receiving from Unit 0
Receiving from Unit 1 Receiving from Unit 1
Receiving from Unit 2 Receiving from Unit 2
2 Receiving from Unit 3 Receiving from Unit 3
Cycle 1 Cycle 2
21. Virtual Cycle
n Multiple FPGA clock cycles for 1 target clock cycle
l Virtually complex hardware by using simple FPGA equipment
• Example. Multiport RAM by driving 1 port RAM multiple times
Drive the circuit of target components
Core
Proceeding INCC
Target Circuit State Router Process the memory accesses
Interleaved
Core (IF) Core (L/S) INCC Send INCC Recv
Memory Access
via Memory Multiplexer Start sending
Sending the synchronized data via Serial I/O (North)
Data Sender Sending the synchronized data via Serial I/O (East)
via Serial I/Os
…
Sending the synchronized data via Serial I/O (West)
Sending the synchronized data via Serial I/O (South)
Receiving the synchronized data via Serial I/O (North)
Receiving the synchronized data via Serial I/O (East)
Data Receiver
via Serial I/Os Receiving the synchronized data via Serial I/O (West)
Receiving the synchronized data via Serial I/O (South)
Finish synchronization
1 Virtual Cycle
Time
Virtual Cycle Virtual Cycle
N N+1
23. Evaluation: Simulation Speed [K cycle/sec]
n = Clock frequency of the target processor [KHz]
l Software simulator: degrading speed with the increasing of #
target cores
l ScalableCore system: constant speed rate
n Relative Speed
l Increasing # cores, Increasing the relative speed
• In simulation of 64 Nodes, achieves 14.2x speed up
ScalableCore system Software Simulator
16.0 14.2
1200 14.0
1000 1000 1000 1000
Simulation Speed
Relative Speed
1000 12.0 10.4
[K cycle / sec]
800 10.0
8.0 6.7
600
343 6.0
400 2.9
149 4.0
200 96 70 2.0
0 0.0
16 32 48 64 16 32 48 64
# Nodes # Nodes
24. Evaluation: Power [W]
n = Energy consumption of the system per sec
l Software simulator: constant consumption [W]
l ScalableCore system: increasing the power [W]
n Relative Efficiency
(=Ratio of energy used for simulation of 1 clock cycle on the target1)
l More efficient, increasing # target cores
• In simulation of 64 nodes, achieves
25.0 22.2 22.9 23.5
ScalableCore system Software Simulator
Relative Efficiency
19.2
100 84 84 84 84 20.0
80 15.0
Power [W]
60 51
38 10.0
40 26
13 5.0
20
0 0.0
16 32 48 64 16 32 48 64
# Nodes # Nodes
25. Conclusion
n ScalableCore system 1.1
An FPGA-based scalable simulation system
for tile architecture evaluations
l Multiple FPGAs
l Two key techniques
• Virtual cycle
• Local Barrier Synchronization
l 14.2 times faster simulation than the software simulator
• When simulating the more detailed architecture the speedup rate
becomes the very larger
n Future Work
l Off-chip DRAM support
l Virtual combined multiple FPGAs for a large core
l Time-multiplexed driven for higher hardware utilization