1. Energy Efficient Coarse-Grain
Reconfigurable Array for Accelerating
Digital Signal Processing
Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza,
Stefania Perri, Paolo Zicari.
Department of Electronics, Computer Science and Systems (DEIS)
University of Calabria, Rende (CS)
2. Outline
Motivation
The proposed Coarse Grain Reconfigurable
Array (CGRA)
Architectural overview
Computational model
Post Layout Results
Comparison
Conclusion
3. The Challenge
Nowadays, Digital Signal Processing (DSP) is extensively used for
several applications
Multimedia
Image analysis and processing
Speech processing
Wireless communication
These applications impose strict hardware requirements
High performance
Real-time operations
High computational load
Intensive arithmetic operations
(add, sub, shift, mult, mult-acc)
Energy-efficiency
Portable devices
Flexibility
Support multiple applications
Match the rapid evolving of the algorithms
4. Executing DSP on various architectures
General Purpose
Full Custom Reconfigurable
Processors
Solutions Computing
&
Programmable
Digital Signal
CGRA FPGA
Processors
Increasing Flexibility
Increasing Performances
Reconfigurable computing architectures provide an
intermediate tradeoff between flexibility and performances
5. Reconfigurable Computing
FPGAs are very flexible, …
Gate-level functions
General routing
… ,but the flexibility is very expensive
FPGAs are slower than ASICs, have lower logic density
and are inefficient for word operations.
Long reconfiguration time
CGRAs use multiple-bits wide PEs and more
speed-, area- and power-efficient routing
structures
Compromise programmability and fixed functionality
Flexible and efficient within an application domain
6. Architectural Overview
Addr.
Data
Config. & Elab. Data
Reconfigurable
RAM
Cell
External Memory Interface PE
Host Interface
I/O DATA & CONFIGURATION CENTRAL CONTROLLER Lached
Programmable
Config. Data
Elab. Data
Switches
RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE
RAM RAM RAM
RAM RAM RAM
PE PE PE
PE PE PE
RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE
RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE
RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE
RAM RAM
RAM RAM RAM
RAM
PE PE
PE PE PE
PE
Distributed small RAMs and on purpose designed interconnection
scheme to achieve high performance
Run-time reconfigurable cells to achieve a high flexibility within the
target application domain
Distributed control logic to reduce control complexity and enhancing
data parallelism
7. The Reconfigurable Cell
I/O interface similar to a
AddrA/B_ext Data_InA/B_ext
conventional RAM
2 input/output data ports Input Stage
2 input address ports
Ram Interface
1 output address port
control signals
I/O control signals Config. Data
Dual Port SRAM
Dual Port SRAM (256*8-bit)
Controls
(256*8-bits) data memory Control Unit Signals
Config.
Reconfigurable 8-bit PE Mem
PE (8-bit)
Internal Control Unit
Output Stage
Two operative states
Addr_Out_ext
Loading
Data_OutA/B_ext
Executing
8. Functionality of the RC in the executing
state
RAM
RAM RAM RAM
PE
PE PE PE
(a) (b) (c) (d)
a) feed-forward mode;
b) feed-back mode;
c) route-through mode;
d) route-through mode (double throughput)
10. The Control Unit
Instructions define the Configuaration Data
execution of vector/block
operations on a large data Config. Instr.
stream Counter
Memory
Each instruction consist of
several fields
op_code #ops Address Descriptors
op_code specifies the
operation code;
Hanshake &
Addresses
Instruction
#ops specifies the Decoder Elab. Control
Generator
number of the operations
to be performed in the AddrA_int Handshake
current instruction; AddrB_int
PE & I/O Signals
Addr_ext
control signals
address descriptors
specify the data
organization in the
memory.
12. The Interconnection Topology
N-bit
NW N NE
W E
SW S SE
neighbor interconnections
interleaved interconnections
2N-bit
Programmable Latched Switches
13. Applications Mapping: Block-level pipelining
RAM(i-1)
Load Execute Load Execute Load Execute Load
RC(i-1)
PE(i-1)
Load Execute Load Execute Load Execute
RC(i)
RAM(i)
Load Execute Load Execute Load
RC(i+1)
PE(i)
The computation is organized in concurrently executing kernels
Each kernel is implemented by a RC
RAM(i+1)
A kernel consumes a set of input data, performs one or more
computations, and produces a set of output data
PE(i+1)
RCs communicate by sending addressed packets of data.
Memory data loading of each cell is overlapped with data producing of
previous cell
An execution is performed as soon as all necessary data input are
available
Data syncronization mechanism is realized by handshake signals
No explicit temporal scheduling of execution is required
14. Applications Mapping: Flexible computational
load balancing
Data parallel
Parallelism in both
vertical/temporal and RAM(1) RAM(1)
horizontal/spatial directions PE(1) PE(1)
Function parallel
RAM(2)
RAM(2) RAM(3)
Horizontal comp. load balancing
PE(2)
achieved via data parallelism PE(2) PE(3)
RAM(3)
RAM(4)
Vertical comp. load balancing PE(3)
achieved by increasing the PE(4)
number of pipeline stages RAM(4)
PE(4)
15. Architecture evaluation
Hardware-assisted simulation environment
developed using a XILINX XC4VLX200 device
The implemented system includes 64 RCs organized in 4x4
quadrants
The number of the required clock cycles were precisely
evaluated for different DSP benchmarks (YCbCr RGB, 2d-
DCT, 2d-FIR) .
Physical Evaluation for the ST 90nm CMOS
technology
Reconfigurable Cell
Synthesis done with Synopsys Design Compiler
Physical Design done with Cadence SoC Encounter, also considering
manufacturing (such as DRCs and antennas) and Signal Integrity (SI)
issues.
Interconnections
Preliminary electrical simulations were performed
Obtained results were compared to 90nm CMOS
Virtex-4 FPGA
16. RC Layout
Input Stage
Technology
CMOS 90nm
Dual Port SRAM
(256*8-bit)
Suppy voltage
1.0 V
Frequency
RAM Interface
1 GHz
Core Area
Configuration
Memory
79.52 um2
Avg. Dyn. Power
PE
@1 GHz
20 mW
Control Unit
Leakage Power
627.6 uW
Output Stage
17. Resources usage/energy/performance trade-
off comparisons: New to Xilinx Virtex-4
Algorithm Proposed Reconfigurable Array
Virtex-4 FPGA (CORE Generator)
Resources/ Throughput Energy Resources / Throughput Energy
Area [mm2] Area [mm2]
[MOPS] Efficiency [MOPS] Efficiency
(8*8-image [MOPS/W] (8*8-image [MOPS/W]
block) (8*8-image block) (8*8-image
block) block)
Color Space 13 RCs / 13.3 45.9 436 Slices + 2 1.7 29.1
Conversion 1.034 Bram / 1.572
2D 20 RCs / 10.5 23.9 440 Slices + 2 1.3 18.4
separable 1.590 Bram/ 1.657
4x4 FIR
2D-DCT 22 RCs / 10.2 20.8 786 Slices + 3 2.1 14.2
(8x8) 1.749 Bram / 2.919
•Speedups ranging from 4.8X to 8X
•Energy efficiency improvement ranging from 24% to 58%
•Area saving up to 40%.
18. Conclusion
Presented VLSI implementation of a new coarse-grain
reconfigurable architecture optimized for high throughput
DSP applications
Performance improvement at a low cost
Exploit spatial and temporal parallelism
High arithmetic processing capability
high bandwidth and low latency memory access
Performance/energy/area evaluations for representative
tasks belonging to the target application domain
Obtained results demonstrate significative advantages
with respect to conventional FPGA
Speedups ranging from 4.8X to 8X
Energy efficiency improvement ranging from 24% to 58%
Area saving up to 40%