RCIM 2008 - - UniCal

Energy Efficient Coarse-Grain
Reconfigurable Array for Accelerating
Digital Signal Processing

Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza,
Stefania Perri, Paolo Zicari.

Department of Electronics, Computer Science and Systems (DEIS)
University of Calabria, Rende (CS)

Outline

Motivation

The proposed Coarse Grain Reconfigurable
Array (CGRA)
Architectural overview
Computational model
Post Layout Results
Comparison

Conclusion

The Challenge
Nowadays, Digital Signal Processing (DSP) is extensively used for
several applications

Multimedia
Image analysis and processing
Speech processing
Wireless communication

These applications impose strict hardware requirements

High performance
Real-time operations
High computational load
Intensive arithmetic operations
(add, sub, shift, mult, mult-acc)
Energy-efficiency
Portable devices
Flexibility
Support multiple applications
Match the rapid evolving of the algorithms

Executing DSP on various architectures
General Purpose
Full Custom Reconfigurable
Processors
Solutions Computing
&
Programmable
Digital Signal
CGRA FPGA
Processors

Increasing Flexibility

Increasing Performances

Reconfigurable computing architectures provide an
intermediate tradeoff between flexibility and performances

Reconfigurable Computing

FPGAs are very flexible, …
Gate-level functions
General routing
… ,but the flexibility is very expensive
FPGAs are slower than ASICs, have lower logic density
and are inefficient for word operations.
Long reconfiguration time
CGRAs use multiple-bits wide PEs and more
speed-, area- and power-efficient routing
structures
Compromise programmability and fixed functionality
Flexible and efficient within an application domain

Architectural Overview
Addr.
Data
Config. & Elab. Data
Reconfigurable
RAM
Cell
External Memory Interface PE
Host Interface
I/O DATA & CONFIGURATION CENTRAL CONTROLLER Lached
Programmable
Config. Data
Elab. Data
Switches

RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE

RAM RAM RAM
RAM RAM RAM
PE PE PE
PE PE PE

RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE

RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE

RAM RAM RAM RAM
RAM RAM
PE PE PE PE
PE PE

RAM RAM
RAM RAM RAM
RAM
PE PE
PE PE PE
PE

Distributed small RAMs and on purpose designed interconnection
scheme to achieve high performance
Run-time reconfigurable cells to achieve a high flexibility within the
target application domain
Distributed control logic to reduce control complexity and enhancing
data parallelism

The Reconfigurable Cell
I/O interface similar to a
AddrA/B_ext Data_InA/B_ext
conventional RAM
2 input/output data ports Input Stage
2 input address ports
Ram Interface
1 output address port

control signals
I/O control signals Config. Data
Dual Port SRAM
Dual Port SRAM (256*8-bit)

Controls
(256*8-bits) data memory Control Unit Signals
Config.
Reconfigurable 8-bit PE Mem
PE (8-bit)
Internal Control Unit
Output Stage
Two operative states
Addr_Out_ext
Loading
Data_OutA/B_ext
Executing

Functionality of the RC in the executing
state

RAM
RAM RAM RAM

PE
PE PE PE

(a) (b) (c) (d)

a) feed-forward mode;
b) feed-back mode;
c) route-through mode;
d) route-through mode (double throughput)

The Processing Element
B-Register A-Register
(8-bit) (8-bit)

Single clock cycle 0001
00000001
00000001

operations S1 S3
S0 S2

ADD, SUB,ACC, 00000000 0000
0000

INC, DEC, MUL, MULT2 S6 S4
MULT1
S6
(8X4-bit) (8X4-bit) S5

MUL-ACC, SHIFT
HA-based 3:2 (FA-based)
Compressor (4-bit) Compressor (8-bit)

Fast and low-cost Adder3 S7=cin
co2 Adder2 Adder1
co1
(4-bit) (8-bit) (4-bit)

Register Register Register
(4-bit) (8-bit) (4-bit)

O[15:12] O[3:0]
O[11:4]

The Control Unit

Instructions define the Configuaration Data
execution of vector/block
operations on a large data Config. Instr.
stream Counter
Memory
Each instruction consist of
several fields
op_code #ops Address Descriptors
op_code specifies the
operation code;
Hanshake &
Addresses
Instruction
#ops specifies the Decoder Elab. Control
Generator
number of the operations
to be performed in the AddrA_int Handshake
current instruction; AddrB_int
PE & I/O Signals
Addr_ext
control signals
address descriptors
specify the data
organization in the
memory.

The Address Generator
subset
step
base_address skip
step_register skip_register down counter

control_signal
=0
end_subset
addr_register
Continuous vector forward scan Continuous vector (column mode) Block scan (forward/reverse mode)
(Step=1, Subset=8, Skip=0) (Step=1/-1, Subset=3, Skip=n-3/-n+3)
forward/reverse scan
(Step=n/-n, Subset=8, Skip=0)
Continuous vector reverse scan
(Step=-1, Subset=8, Skip=0)
address_calculation
Sparse vector forward scan
_adder (Step=2, Subset=4, Skip=0)

Sparse vector reverse scan
Sparse vector (column mode)
current_address (Step=-2, Subset=4, Skip=0)
forward/reverse scan
(Step=2n/-2n , Subset=4, Skip=0)
Rotating vector forward scan
(Step=1, Subset=8, Skip=-7)

Rotating vector reverse scan
(Step=-1, Subset=8, Skip=+7)

The Interconnection Topology
N-bit

NW N NE

W E

SW S SE

neighbor interconnections
interleaved interconnections
2N-bit

Programmable Latched Switches

Applications Mapping: Block-level pipelining

RAM(i-1)
Load Execute Load Execute Load Execute Load
RC(i-1)

PE(i-1)

Load Execute Load Execute Load Execute
RC(i)

RAM(i)

Load Execute Load Execute Load
RC(i+1)
PE(i)

The computation is organized in concurrently executing kernels
Each kernel is implemented by a RC
RAM(i+1)

A kernel consumes a set of input data, performs one or more
computations, and produces a set of output data
PE(i+1)
RCs communicate by sending addressed packets of data.
Memory data loading of each cell is overlapped with data producing of
previous cell

An execution is performed as soon as all necessary data input are
available
Data syncronization mechanism is realized by handshake signals
No explicit temporal scheduling of execution is required

Applications Mapping: Flexible computational
load balancing
Data parallel
Parallelism in both
vertical/temporal and RAM(1) RAM(1)

horizontal/spatial directions PE(1) PE(1)

Function parallel
RAM(2)
RAM(2) RAM(3)
Horizontal comp. load balancing
PE(2)
achieved via data parallelism PE(2) PE(3)

RAM(3)
RAM(4)
Vertical comp. load balancing PE(3)

achieved by increasing the PE(4)

number of pipeline stages RAM(4)

PE(4)

Architecture evaluation
Hardware-assisted simulation environment
developed using a XILINX XC4VLX200 device
The implemented system includes 64 RCs organized in 4x4
quadrants
The number of the required clock cycles were precisely
evaluated for different DSP benchmarks (YCbCr RGB, 2d-
DCT, 2d-FIR) .
Physical Evaluation for the ST 90nm CMOS
technology
Reconfigurable Cell
Synthesis done with Synopsys Design Compiler
Physical Design done with Cadence SoC Encounter, also considering
manufacturing (such as DRCs and antennas) and Signal Integrity (SI)
issues.
Interconnections
Preliminary electrical simulations were performed
Obtained results were compared to 90nm CMOS
Virtex-4 FPGA

RC Layout
Input Stage
Technology
CMOS 90nm
Dual Port SRAM
(256*8-bit)
Suppy voltage
1.0 V
Frequency
RAM Interface
1 GHz
Core Area
Configuration
Memory
79.52 um2
Avg. Dyn. Power
PE
@1 GHz
20 mW
Control Unit
Leakage Power
627.6 uW
Output Stage

Resources usage/energy/performance trade-
off comparisons: New to Xilinx Virtex-4
Algorithm Proposed Reconfigurable Array
Virtex-4 FPGA (CORE Generator)
Resources/ Throughput Energy Resources / Throughput Energy
Area [mm2] Area [mm2]
[MOPS] Efficiency [MOPS] Efficiency
(8*8-image [MOPS/W] (8*8-image [MOPS/W]
block) (8*8-image block) (8*8-image
block) block)
Color Space 13 RCs / 13.3 45.9 436 Slices + 2 1.7 29.1
Conversion 1.034 Bram / 1.572

2D 20 RCs / 10.5 23.9 440 Slices + 2 1.3 18.4
separable 1.590 Bram/ 1.657
4x4 FIR
2D-DCT 22 RCs / 10.2 20.8 786 Slices + 3 2.1 14.2
(8x8) 1.749 Bram / 2.919

•Speedups ranging from 4.8X to 8X
•Energy efficiency improvement ranging from 24% to 58%
•Area saving up to 40%.

Conclusion
Presented VLSI implementation of a new coarse-grain
reconfigurable architecture optimized for high throughput
DSP applications
Performance improvement at a low cost
Exploit spatial and temporal parallelism
High arithmetic processing capability
high bandwidth and low latency memory access

Performance/energy/area evaluations for representative
tasks belonging to the target application domain

Obtained results demonstrate significative advantages
with respect to conventional FPGA
Speedups ranging from 4.8X to 8X
Energy efficiency improvement ranging from 24% to 58%
Area saving up to 40%

RCIM 2008 - - UniCal

Recommended

Recommended

More Related Content

More from Marco Santambrogio

More from Marco Santambrogio (20)

Recently uploaded

Recently uploaded (20)

RCIM 2008 - - UniCal