The document proposes a novel flexible accelerator architecture comprising computational units (FCUs) that support the execution of various digital signal processing (DSP) operation templates. The FCUs perform computations using carry-save (CS) arithmetic, allowing intermediate results to be reused without conversion to binary. This enables more aggressive CS optimizations than previous approaches. The proposed architecture analyzes logic size, area, and power consumption using Xilinx 14.2. Each FCU can be configured to perform addition, subtraction, and multiplication operations in a pipelined fashion to fuse computations and improve performance.
Flexible dsp accelerator architecture exploiting carry save arithmetic
1. Flexible DSP Accelerator Architecture Exploiting
Carry-Save Arithmetic
Abstract:
Hardware acceleration has been proved an extremelypromising implementation strategy for the
digital signal processing (DSP)domain. Rather than adopting a monolithic application-specific
integratedcircuit design approach, in this brief, we present a novel acceleratorarchitecture
comprising flexible computational units that support theexecution of a large set of operation
templates found in DSP kernels.We differentiate from previous works on flexible accelerators by
enablingcomputations to be aggressively performed with carry-save (CS) formatteddata.
Advanced arithmetic design concepts, i.e., recoding techniques,are utilized enabling CS
optimizations to be performed in a larger scopethan in previous approaches.The proposed
architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.
Enhancement of the project:
Perform the other temple of the FCU.
Existing system:
Modern embedded systems target high-end application domainsrequiring efficient
implementations of computationally intensivedigital signal processing (DSP) functions. The
incorporation ofheterogeneity through specialized hardware accelerators improvesperformance
and reduces energy consumption. Althoughapplication-specific integrated circuits (ASICs) form
the ideal accelerationsolution in terms of performance and power, their inflexibilityleads to
increased silicon complexity, as multiple instantiated ASICsare needed to accelerate various
kernels. Many researchers haveproposed the use of domain-specific coarse-grained
reconfigurable accelerators in order to increase ASICs’ flexibility withoutsignificantly
compromising their performance.
The aforementioned reconfigurable architectures excludearithmetic optimizations during the
architectural synthesis andconsider them only at the internal circuit structure of
primitivecomponents, e.g., adders, during the logic synthesis. However,research activities have
shown that the arithmeticoptimizations at higher abstraction levels than the structuralcircuit one
significantly impact on the datapath performance. In, timing-driven optimizations based on
carry-save (CS) arithmetic were performed at the post-Register Transfer Level (RTL) design
stage. In, common subexpression eliminationin CS computations is used to optimize linear DSP
circuits. Verma et al. developed transformation techniques on theapplication’s DFG to maximize
the use of CS arithmetic prior theactual datapath synthesis. The aforementioned CS
optimizationapproaches target inflexible datapath, i.e., ASIC, implementations. Recently, Xydis
2. et al. proposed a flexible architecturecombining the ILP and pipelining techniques with the CS-
awareoperation chaining. However, the entire aforementioned solutions featurean inherent
limitation, i.e., CS optimization is bounded to mergingonly additions/subtractions. A CS to
binary conversion is insertedbefore each operation that differs from addition/subtraction,
e.g.,multiplication, thus, allocating multiple CS to binary conversionsthat heavily degrades
performance due to time-consuming carrypropagations.
Disadvantages:
high the area
high the power
Proposed system:
The proposed flexible accelerator architecture is shown in Fig. 1.Each FCU operates directly on
CS operands and produces data inthe same form1 for direct reuse of intermediate results. Each
FCU operates on 16-bit operands. Such a bit-length is adequate for themost DSP datapaths, but
the architectural concept of the FCUcan be straightforwardly adapted for smaller or larger bit-
lengths.The number of FCUs is determined at design time based on theILP and area constraints
imposed by the designer. The CStoBinmodule is a ripple-carry adder and converts the CS form
to the two’scomplement one. The register bank consists of scratch registers andis used for storing
intermediate results and sharing operands amongthe FCUs. Different DSP kernels (i.e., different
register allocationand data communication patterns per kernel) can be mapped ontothe proposed
architecture using post-RTL datapath interconnectionsharing techniques. The control unit drives
3. the overallarchitecture (i.e., communication between the data port and theregister bank,
configuration words of the FCUs and selection signalsfor the multiplexers) in each clock cycle.
Structure of the Proposed Flexible Computational Unit:
The structure of the FCU (Fig. 2) has been designed to enablehigh-performance flexible
operation chaining based on a library of operation templates. Each FCU can be configured to
anyof the T1–T5 operation templates shown in Fig. 3.
Figure 1 : Abstract form of the flexible datapath.
The proposedFCU enables intra-template operation chaining by fusing the additionsperformed
before/after the multiplication and performs any partialoperation template of the following
complex operations:
W∗ = A × (X∗ + Y∗) + K∗ (1)
W∗ = A × K∗ + (X∗ + Y ∗). (2)
4. Figure 2 : FCU.
The following relation holds for all CS data: X∗ = {XC, XS} =XC + XS. The operand A is a two’s
complement number. Thealternative execution paths in each FCU are specified after
properlysetting the control signals of the multiplexers MUX1 and MUX2 (Fig. 2). The
multiplexer MUX0 outputs Y ∗ when CL0 = 0(i.e., X∗ + Y ∗ is carried out) or Y ∗ when X∗ − Y
∗ is requiredand CL0 = 1. The two’s complement 4:2 CS adder produces theN∗ = X∗ +Y ∗ when
the input carry equals 0 or the N∗ = X∗ −Y ∗when the input carry equals 1. The MUX1
determines if N∗ (1) orK∗ (2) is multiplied with A. TheMUX2 specifies if K∗ (1) or N∗ (2)is
added with the multiplication product. The multiplexer MUX3accepts the output of MUX2 and
its 1’s complement and outputsthe former one when an addition with the multiplication product
isrequired (i.e., CL3 = 0) or the later one when a subtraction is carriedout (i.e., CL3 = 1). The 1-
bit ace for the subtraction is added in theCS adder tree.
Figure 3 : FCU template library.
5. The multiplier comprises a CS-to-MB module, which adopts arecently proposed techniqueto
recode the 17-bit P∗ in itsrespective MB digits with minimal carry propagation. The
multiplier’sproduct consists of 17 bits. The multiplier includes a compensationmethod for
reducing the error imposed at the product’s accuracy bythe truncation technique. However, since
all the FCU inputsconsist of 16 bits and provided that there are no overflows, the16 most
significant bits of the 17-bit W∗ (i.e., the output of theCarry-Save Adder (CSA) tree, and thus, of
the FCU) are inserted inthe appropriate FCU when requested.
Advantages:
high degrees of computational density
reduce the area
reduce the power
Software implementation:
Modelsim
Xilinx ISE