CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration

CSBP: A Fast Circuit Similarity-Based
Placement for FPGA Incremental Design
and Design Space Exploration

1Xiaoyu Shi, 1Dahua Zeng, 2Yu Hu, 1Guohui Lin, 1Osmar R. Zaiane

1Dept. of Computing Science, University of Alberta
2Dept. of Electrical and Computer Engineering, University of Alberta

Presented by Xiaoyu Shi

LOGO

Please address comments to bryanhu@ece.ualberta.ca

Outline

Introduction

Circuit Similarity-Based Placement

Experimental Results

Conclusion and Future Work

Introduction
 Field Programmable Gate Array (FPGA)
 Ease of design, low start-up costs and fast manufacturing
turnaround time.
 Size of FPGAs has reached million gates level.
 Modern FPGA designs suffer from long compilation time.

Xilinx SPARTAN-6 board
 FPGA placement
 Determines which logic block within an FPGA should implement each of the
logic blocks required by the circuits.
 Has a significant impact on the performance and routability in nanometer
circuit designs.
 The optimization goals are to minimize certain criteria, such as wire length,
critical delay and area.
 Now becomes the bottleneck of modern FPGA circuit design [Chen’06].
 Up-to-date fast placement algorithms
 Extensive studies have been performed to improve the placement efficiency
as a single synthesis phase for decades.
 State-of-the-art work includes using multi-core [Ludwin’08], embedding-
based [Gopalakrishnan’06], partitioning-based [Maidee’05], multi-level
[Sankar’99], simulated annealing [Betz’97].

Reusable Info in CAD
 Incremental design for FPGAs
 Design preservation is the key of incremental design.
 Similarity among circuits exists because functional changes or optimizations
are small, and they generally result in a similar topology of the modified
circuit compared to the original circuit [Krishnaswamy’09].

Final design
Final iteration
Optimizations, timing,
Iteration 3 … etc …
Changes due to
Iteration 2 verification, timing, etc
Initial design
Iteration 1

Incremental design process for FPGAs

Reusable Info in CAD (Cont.)
 Design space exploration for FPGAs
 FPGA design offers a variety of customizations by varying design
parameters.
 Local similarity and global similarity exist in design space exploration.

Final design

Optimizations, timing,
etc …
Changes due to
verification, timing, etc
Initial design
Constant multiplier blocks by CMU SPIRAL [Puschel’04]

Data Mining
 Overview
 The key of data mining is to extract patterns and useful information from
data, including text, graphs and circuits, etc.
 It has been extensively studied since 1950s, and has been widely applied to
many domains, such as businesses, sciences and health cares.
 Graph mining, including graph pattern mining, graph classification and graph
compression, is a research hot area in data mining [Borgwardt’08].
 Graph similarity
 It quantitatively defines the topological similarity between two graphs.
 It has been used to many applications, such as web searching
[Kleinberg’99], social network mapping [Watts’99] and chemical structure
matching [Hattori’03].

Graph Similarity
 Summary of graph similarity measures
Measure Description Time Global
Complexity Topo
Isomorphism Identifying a bijection between the nodes NP-Hard Yes
[Pelillo’02] of two graphs which preserves (directed)
adjacency
Edit distance Given a cost function on edit operations, NP-Hard Yes
[Bunke’99] determine the minimum cost
transformation from one graph to another
Common subgraph Identifying the largest isomorphic NP-Hard Yes
[Fernandez’01] subgraphs of two graphs
Iterative methods Two graph elements are similar if their Cubic Yes
[Blondel’04] neighborhoods are similar
Statistical methods Assessing aggregate measures of graph Linear No
[Alberta’02] structure, degree distribution, diameter,
betweenness measures

 Iterative methods
 It has lower computational complexity and considers global topological
information.
 It takes advantage of the graph sparsity.

Circuit Similarity
 Circuit similarity
 We define circuit similarity to describe the similar topological structures
between two circuits.
 We adapt the iterative methods in graph similarity.
 It exists in several CAD phases, such as placement, routing and verification.
 It can be widely used to accelerate FPGA designs, such as incremental
design and exploration of the design space, etc.

Motivating Example
 Circuit similarity algorithm
V7 V8 V9 V10 V11 V12 V13 V14 V15 V16

V’7
0.92 0.25 0.48 0.15 0 0 0 0.42 0.06 0
V’8
0 0.73 0 0 0.05 0 0.39 0 0.17 0.06
V’9
0 0.39 0 0 0.4 0 0.73 0 0.06 0.48
V’10
Graph G
0.48 0 0.89 0.25 0.3 0.12 0.14 0.06 0.33 0.09
V’11
0 0 0.11 0.48 0 0.86 0 0.36 0.17 0
V’12
0 0 0.3 0.34 0.64 0.25 0.39 0.34 0.15 0.42
V’13
0.48 0.25 0.07 0.4 0 0.36 0 0.88 0.06 0
V’14
0.4 0.39 0.29 0.15 0.15 0.18 0.12 0.46 0.59 0.06
V’15
0 0.12 0.09 0 0.63 0 0.36 0 0.27 0.82

Similarity score matrix for G and G’
Graph G’

Motivating Example (Cont.)
 Circuit similarity-based
placement
 The initial placement of the new
circuit design (G’) is generated by
computing the similarity between
the original (G) and modified
circuits, and finding the
correspondent node matching.
 A low-temperature simulated
annealing is applied to further
refine the results.
 The proposed circuit similarity
algorithm can be used to speedup
placement, which allows faster
incremental design and design
space exploration.

Motivating Example (Cont.)

(a) Placement of (b) Init placement (c) Final placement (d) init placement (c) Final placement
reference config using CS using CS using VPR using VPR
Placement layouts comparison of circuit “des”

 A real example Wire Delay Critical Runtime
(E-05) Delay (s)
 For circuit “des”, the reference (E-08)
configuration (synthesized using
“resyn3” script in ABC) has 1245 CS-init 306 5.93 - -
CLBs and 1501 nets while the
new configuration (synthesized VPR-init 1087 14.00 - -
using “rwsat2” script in ABC) has
1215 CLBs and 1471 nets. CS-final 237 5.08 8.28 13.38
 The results show that CSBP
successfully finds the internal VPR-final 221 4.98 10.10 28.42
node correspondence.

Status of placement results of circuit “des”

Circuit Similarity CAD Flow

CAD flow for incremental design CAD flow for design space exploration

Circuit Similarity Algorithm
 Iterative similarity algorithm
 We employ the iterative similarity
algorithm for undirected molecular
graphs [Rupp’07].
 We adapt the iterative similarity
algorithm to consider directed
circuit graphs, fix the I/O pins, and
compute the similarity of fanin
and fanout nodes respectively,
based on unique circuit
constraints.

If (|in(vi)| < |in(v’j)| and |out(vi)| < |out(v’j)|)

Summary of variables

Performance Enhancement
 Support constraint
 A support of a node is the set of
nodes with predefined matchings

 Formally, if v ∈ G and v’ ∈ G’, the
in the transitive fanin or fanout
cone of this node.

support constraint requires:

where β ∈ (0,1].
 Level constraint
 A topological sort and reverse

 Formally, if v ∈ G and v’ ∈ G’, the
topological sort can label each
internal node with two values.

level constraint requires:

where Bl and Br are two
nonnegative integers.

Effectiveness of the pruning techniques

Incremental Design
 f
 CAD flow
 Two-iteration CAD flow.
 CSBP flow (a) and from-scratch
flow (b) are compared.
 Optimization “imfs” reduces the
number of CLBs by 2%.
 Settings
 Two versions of CSBP are
compared: A high quality version
(CS) with β = 0.5, inner_num = 1
and Bl = Br = 1; A turbo version
(CS-t) with β = 1, inner_num = 0.1
and Bl = Br = 0.
 CSBP is implemented in C and
evaluated on the 20 largest
MCNC benchmarks.
 The results are averaged over 5
funs on a Linux server with dual-
core 2.19GHz CPU and 5GB
memory.
 CS2 package [Goldberg’97] is
used for maximum matching
problem. CAD flow for incremental design

Results
 Initial placement results
 Bounding box cost (bb cost) and delay cost are compared.
 Clearly, the initial placement results generated using CS is much better than
VPR’s initial results, and is very close to VPR’s final results.

100% 100%
90% 90%
80% 80%
Percentage

Percentage
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
s38417
s38584

s38417
s38584
s298

s298
pdc

alu4

ex1010

pdc
alu4
apex2
apex4

ex1010

tseng

apex2
apex4

tseng
ex5p
frisc

ex5p
seq

des

frisc
des

seq
diffeq

misex3

spla

bigkey
clma

diffeq
dsip

misex3

spla
bigkey
clma

dsip
elliptic

elliptic
CS-init VPR-final VPR-init CS-init VPR-final VPR-init

Comparisons of initial bb cost Comparisons of initial delay cost

CS reduces bb cost by 72% on avg. compared to VPR CS reduces delay cost by 53% on avg. compared to VPR

Results (Cont.)
300000
 Post-routing results comparison
250000
 A low-temperature annealing is 200000
applied to the initial results.
150000
 Wire length, critical delay and area
are compared. 100000

 The results demonstrate the 50000
effectiveness of the pruning 0
techniques, which do not affect the

apex2
apex4

ex1010

tseng
ex5p

s38417
s38584
seq
bigkey

des
clma

diffeq
dsip

misex3

s298

spla
alu4

pdc
frisc
elliptic
quality significantly.

CS-t CS VPR Wire length
CS increases the wire length by 3% on avg.
4.00E+08
4.50E-07
3.50E+08 4.00E-07
3.00E+08 3.50E-07
2.50E+08 3.00E-07
2.00E+08 2.50E-07
1.50E+08 2.00E-07
1.00E+08 1.50E-07
1.00E-07
5.00E+07
5.00E-08
0.00E+00
0.00E+00
s38417
s38584
s298
pdc
alu4
apex2
apex4

ex1010

tseng
des

ex5p
frisc

seq
bigkey
clma

diffeq
dsip

misex3

spla
elliptic

s38417
s38584
s298
pdc
alu4
apex2
apex4

ex1010

tseng
des

ex5p
frisc

seq
bigkey
clma

diffeq
dsip

misex3

spla
elliptic
CS-t CS VPR Area
CS-t CS VPR Critical delay
CS increases the area by 2% on avg. CS increases the crit. delay by 6% on avg.

Results (Cont.)
 Runtime comparison
 Only placement time is compared.
 CS-t achieves 31x speedup on average, with up to 91x.
 More speedup is expected when circuits become larger.

100
90
80
70
Speedups

60
50
40
30
20
10
0

CS-t CS VPR

Speedups compared to VPR

Design Space Exploration
 CAD flow
 Study logic-level and algorithm-
level design space, respectively.
 CSBP flow (a) and from-scratch
flow (b) are compared.
 Settings
 The logic-level design space
consists of 19 configurations
generated by 19 ABC1 synthesis
scripts in abc.rc.
 The algorithm-level design space
consists of 18 configurations of
constant multiplier generated by
CMU SPIRAL [Puschel’04]
varying bits from 7 to 252.
 Both CS and CS-t are evaluated.
 The benchmarking environments
are the same as logic-level design
space exploration.

1 http://www.eecs.berkeley.edu/~alanmi/abc/
2
CAD flow for design space exploration
Bit = 16 is abandoned due to ABC crash

Logic-level Sample Synthesis Scripts
Alias Scripts
resyn "b; rw; rwz; b; rwz; b"

resyn2 "b; rw; rf; b; rw; rwz; b; rfz; rwz; b"
resyn2a "b; rw; b; rw; rwz; b; rwz; b"

src_rw "st; rw -l; rwz -l; rwz -l"

src_rs "st; rs -K 6 -N 2 -l; rs -K 9 -N 2 -l; rs -K 12 -N 2 -l"

choice "fraig_store; resyn; fraig_store; resyn2; fraig_store; fraig_restore"
rwsat "st; rw -l; b -l; rw -l; rf -l"

compress "b -l; rw -l; rwz -l; b -l; rwz -l; b -l"
share "st; multi -m; fx; resyn2"

http://www.eecs.berkeley.edu/~alanmi/abc/

Logic Level Results
2500

 Initial results comparison 2000

 The number of CLBs and levels vary 1500
widely in logic-level design space. 1000
 Show circuit “dsip” as an example. 500
 Bounding box cost and delay cost are
0
compared for initial placement

shake
rwsat2

share

resyn2rsdc
resyn2a

choice

compress2rsdc
resyn2

resyn3

choice2
rwsat

src_rs
compress2

src_rw

src_rws
resyn2rs
resyn

compress

compress2rs
results.

CS CS-t VPR
Initial bb cost of “dsip”
CS reduces bb cost by 76% on avg.
4.00E-04

Critical delay 3.00E-04

2.00E-04

1.00E-04

0.00E+00

compress2rs…
resyn2a
resyn2

resyn3

compress2

shake

src_rws
resyn2rs
resyn

compress

rwsat2

share

compress2rs
resyn2rsdc
choice
choice2
rwsat

src_rs
src_rw
CS CS-t VPR Initial delay cost of “dsip”

CS reduces delay cost by 48% on avg.
Characteristics of logic-level design space

Logic Level Results (Cont.)
 Final placement results
 Wire length and critical delay of circuit “dsip” are compared.
 The final results produced by CS and CS-t are very close or better
compared to VPR’s, with 32% overhead for wire length and 20%
improvement for critical delay.

100%
100%

80% 80%
Percentage

Percentage
60% 60%

40% 40%

20% 20%

0% 0%

resyn2a
resyn2

resyn3

compress2

shake

src_rws
resyn2rs
resyn

compress

rwsat2

share

compress2rs
resyn2rsdc
choice

compress2rsdc
choice2
rwsat

src_rs
src_rw
resyn2a
resyn2

resyn3

compress2

shake

src_rws
resyn2rs
resyn

compress

rwsat2

share

compress2rs
resyn2rsdc
choice

compress2rsdc
choice2
rwsat

src_rs
src_rw

CS-t CS VPR CS-t CS VPR

Final wire length comparison of “dsip” Final critical delay comparison of “dsip”

800
700
 Design space shape characterization 600
 We compare the minimal, median and 500
maximal wire length and critical delay 400
produced by CS, CS-t and VPR. 300
200
 We also compare the shapes of each
configuration over 19 designs. 100
0
 The almost identical curves show that

compress2…
shake
rwsat2

share

resyn2rsdc
resyn2a

choice
resyn2

resyn3

choice2
rwsat

src_rs
compress2

src_rw

src_rws
resyn2rs
resyn

compress

compress2rs
CSBP is able to accurately depict the
shape of a design space.
vpr cs cs-t
Shape of final wire length of circuit “dsip”
2500
4.5E-07
0.0000004
2000
3.5E-07
0.0000003
1500
2.5E-07
0.0000002
1000
1.5E-07

500 0.0000001
5E-08
0 0

ex1010
apex2
apex4

tseng
des

ex5p

s38417
s38584
bigkey
clma

diffeq
dsip

misex3

s298

seq
spla
pdc
alu4

frisc
elliptic
s38417
s38584
s298
alu4
apex2
apex4

ex1010

pdc

tseng
bigkey

des

ex5p
frisc

seq
spla
clma

diffeq
dsip

misex3
elliptic

vpr-min cs-min cs-t-min vpr-min cs-min cs-t-min
Shape of minimal wire length of 20 circuits over 19 designs Shape of minimal crit. delay of 20 circuits over 19 designs

 Runtime comparison
 Only placement time is compared.
 CS-t achieves 30x speedup on
average, with up to 100x.
 In practice, one can take
advantage of the significant
speedup of CS-t to perform quick
design space exploration.
100
90
80
70
Speedups

60
50
40
30
20
10
0
s38417
s38584
s298
pdc
alu4
apex2
apex4

tseng
ex1010

frisc
des

ex5p

seq
spla
bigkey
clma

diffeq

misex3
dsip
elliptic

CS CS-t VPR
Runtime comparison
Speedups compared to VPR (“*” marked time is measured with a timeout )

Algorithm Level Results
 Experimental settings
 The algorithm-level design is a
constant multiplier.
 The design parameter explored in our
experiments is the fractional bits
varying from 7 to 251.
 CMU SPIRAL is used to generate
RTL design based on Hcub algorithm
[Voronenko’07]. Characteristics of algorithm-level design
space generated by CMU SPIRAL
 Experimental results
 The initial and final placement results
are similar to logic-level space
exploration.
 CS and CS-t achieve 7x and 30x
speedup compared VPR,
respectively.

An example of a constant parallel multiplier
1 Bit = 16 is abandoned due to ABC crash

Algorithm Level Results (Cont.)
 Wire length-delay space comparison
 The pareto-points, which are the optimal configurations in a design space,
are of most interests to IC designers.
 CS and VPR find the same pareto-points.
 Bits = 24 is used as the reference circuit.

4.00E-07 4.25E-07
Estimated critical delay

Estimated critical delay
3.50E-07 B19 B25 3.75E-07 B25
B19
B18 B18
3.00E-07 B23 3.25E-07 B23
B22
B17 B22
B21 B17
2.50E-07 B14 B21 2.75E-07 B14

B12 B15
B15 B12
2.00E-07 2.25E-07
B8
B7 B10
B10
1.50E-07 B9 1.75E-07 B8 B9
B7
0 100 200 300 400 500 0 200 400 600

Wire length Wire length

Wire length-delay space of VPR Wire length-delay space of CS

Future Work
 Improvement to CSBP
 Integrate predefined matchings, for example, naming matching, into our
CSBP to further enhance both the efficiency and the quality of the design.
 Other applications
 Study the effectiveness of applying circuit similarity algorithm to other
applications, such as routing and sequential verification for FPGAs

Conclusion
 Proposed an efficient circuit similarity algorithm
 Developed CSBP, a fast circuit similarity-based placement for
FPGAs
 Applied CSPB to incremental design and design space exploration.
 Open-source tool available at:
http://webdocs.cs.ualberta.ca/~xshi/soft.html
 Applied CSBP to incremental design for FPGAs
 CSBP is able to reduce engineering effort by capturing the similarity from the
previous design iterations.
 CSBP is 31x faster compared to VPR.
 Applied CSBP to design space exploration for FPGAs
 CSBP can precisely depict the shape of a design space and pinpoint the
optimal designs.
 CSBP is 30x faster compared to VPR.

Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane

CSBP: A Fast Circuit Similarity-Based Placement for FPGA
Incremental Design and Design Space Exploration

LOGO
www.themegallery.com

CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Destacado

Destacado (11)

Similar a CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration

Similar a CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration (20)

Último

Último (20)

CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration