Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's Why
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration
1. CSBP: A Fast Circuit Similarity-Based
Placement for FPGA Incremental Design
and Design Space Exploration
1Xiaoyu Shi, 1Dahua Zeng, 2Yu Hu, 1Guohui Lin, 1Osmar R. Zaiane
1Dept. of Computing Science, University of Alberta
2Dept. of Electrical and Computer Engineering, University of Alberta
Presented by Xiaoyu Shi
LOGO
Please address comments to bryanhu@ece.ualberta.ca
2. Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
3. Introduction
Field Programmable Gate Array (FPGA)
Ease of design, low start-up costs and fast manufacturing
turnaround time.
Size of FPGAs has reached million gates level.
Modern FPGA designs suffer from long compilation time.
Xilinx SPARTAN-6 board
FPGA placement
Determines which logic block within an FPGA should implement each of the
logic blocks required by the circuits.
Has a significant impact on the performance and routability in nanometer
circuit designs.
The optimization goals are to minimize certain criteria, such as wire length,
critical delay and area.
Now becomes the bottleneck of modern FPGA circuit design [Chen’06].
Up-to-date fast placement algorithms
Extensive studies have been performed to improve the placement efficiency
as a single synthesis phase for decades.
State-of-the-art work includes using multi-core [Ludwin’08], embedding-
based [Gopalakrishnan’06], partitioning-based [Maidee’05], multi-level
[Sankar’99], simulated annealing [Betz’97].
4. Reusable Info in CAD
Incremental design for FPGAs
Design preservation is the key of incremental design.
Similarity among circuits exists because functional changes or optimizations
are small, and they generally result in a similar topology of the modified
circuit compared to the original circuit [Krishnaswamy’09].
Final design
Final iteration
Optimizations, timing,
Iteration 3 … etc …
Changes due to
Iteration 2 verification, timing, etc
Initial design
Iteration 1
Incremental design process for FPGAs
5. Reusable Info in CAD (Cont.)
Design space exploration for FPGAs
FPGA design offers a variety of customizations by varying design
parameters.
Local similarity and global similarity exist in design space exploration.
Final design
Optimizations, timing,
etc …
Changes due to
verification, timing, etc
Initial design
Constant multiplier blocks by CMU SPIRAL [Puschel’04]
6. Data Mining
Overview
The key of data mining is to extract patterns and useful information from
data, including text, graphs and circuits, etc.
It has been extensively studied since 1950s, and has been widely applied to
many domains, such as businesses, sciences and health cares.
Graph mining, including graph pattern mining, graph classification and graph
compression, is a research hot area in data mining [Borgwardt’08].
Graph similarity
It quantitatively defines the topological similarity between two graphs.
It has been used to many applications, such as web searching
[Kleinberg’99], social network mapping [Watts’99] and chemical structure
matching [Hattori’03].
7. Graph Similarity
Summary of graph similarity measures
Measure Description Time Global
Complexity Topo
Isomorphism Identifying a bijection between the nodes NP-Hard Yes
[Pelillo’02] of two graphs which preserves (directed)
adjacency
Edit distance Given a cost function on edit operations, NP-Hard Yes
[Bunke’99] determine the minimum cost
transformation from one graph to another
Common subgraph Identifying the largest isomorphic NP-Hard Yes
[Fernandez’01] subgraphs of two graphs
Iterative methods Two graph elements are similar if their Cubic Yes
[Blondel’04] neighborhoods are similar
Statistical methods Assessing aggregate measures of graph Linear No
[Alberta’02] structure, degree distribution, diameter,
betweenness measures
Iterative methods
It has lower computational complexity and considers global topological
information.
It takes advantage of the graph sparsity.
8. Circuit Similarity
Circuit similarity
We define circuit similarity to describe the similar topological structures
between two circuits.
We adapt the iterative methods in graph similarity.
It exists in several CAD phases, such as placement, routing and verification.
It can be widely used to accelerate FPGA designs, such as incremental
design and exploration of the design space, etc.
9. Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
11. Motivating Example (Cont.)
Circuit similarity-based
placement
The initial placement of the new
circuit design (G’) is generated by
computing the similarity between
the original (G) and modified
circuits, and finding the
correspondent node matching.
A low-temperature simulated
annealing is applied to further
refine the results.
The proposed circuit similarity
algorithm can be used to speedup
placement, which allows faster
incremental design and design
space exploration.
12. Motivating Example (Cont.)
(a) Placement of (b) Init placement (c) Final placement (d) init placement (c) Final placement
reference config using CS using CS using VPR using VPR
Placement layouts comparison of circuit “des”
A real example Wire Delay Critical Runtime
(E-05) Delay (s)
For circuit “des”, the reference (E-08)
configuration (synthesized using
“resyn3” script in ABC) has 1245 CS-init 306 5.93 - -
CLBs and 1501 nets while the
new configuration (synthesized VPR-init 1087 14.00 - -
using “rwsat2” script in ABC) has
1215 CLBs and 1471 nets. CS-final 237 5.08 8.28 13.38
The results show that CSBP
successfully finds the internal VPR-final 221 4.98 10.10 28.42
node correspondence.
Status of placement results of circuit “des”
13. Circuit Similarity CAD Flow
CAD flow for incremental design CAD flow for design space exploration
14. Circuit Similarity Algorithm
Iterative similarity algorithm
We employ the iterative similarity
algorithm for undirected molecular
graphs [Rupp’07].
We adapt the iterative similarity
algorithm to consider directed
circuit graphs, fix the I/O pins, and
compute the similarity of fanin
and fanout nodes respectively,
based on unique circuit
constraints.
If (|in(vi)| < |in(v’j)| and |out(vi)| < |out(v’j)|)
Summary of variables
15. Performance Enhancement
Support constraint
A support of a node is the set of
nodes with predefined matchings
Formally, if v ∈ G and v’ ∈ G’, the
in the transitive fanin or fanout
cone of this node.
support constraint requires:
where β ∈ (0,1].
Level constraint
A topological sort and reverse
Formally, if v ∈ G and v’ ∈ G’, the
topological sort can label each
internal node with two values.
level constraint requires:
where Bl and Br are two
nonnegative integers.
Effectiveness of the pruning techniques
16. Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
17. Incremental Design
f
CAD flow
Two-iteration CAD flow.
CSBP flow (a) and from-scratch
flow (b) are compared.
Optimization “imfs” reduces the
number of CLBs by 2%.
Settings
Two versions of CSBP are
compared: A high quality version
(CS) with β = 0.5, inner_num = 1
and Bl = Br = 1; A turbo version
(CS-t) with β = 1, inner_num = 0.1
and Bl = Br = 0.
CSBP is implemented in C and
evaluated on the 20 largest
MCNC benchmarks.
The results are averaged over 5
funs on a Linux server with dual-
core 2.19GHz CPU and 5GB
memory.
CS2 package [Goldberg’97] is
used for maximum matching
problem. CAD flow for incremental design
18. Results
Initial placement results
Bounding box cost (bb cost) and delay cost are compared.
Clearly, the initial placement results generated using CS is much better than
VPR’s initial results, and is very close to VPR’s final results.
100% 100%
90% 90%
80% 80%
Percentage
Percentage
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
s38417
s38584
s38417
s38584
s298
s298
pdc
alu4
ex1010
pdc
alu4
apex2
apex4
ex1010
tseng
apex2
apex4
tseng
ex5p
frisc
ex5p
seq
des
frisc
des
seq
diffeq
misex3
spla
bigkey
clma
diffeq
dsip
misex3
spla
bigkey
clma
dsip
elliptic
elliptic
CS-init VPR-final VPR-init CS-init VPR-final VPR-init
Comparisons of initial bb cost Comparisons of initial delay cost
CS reduces bb cost by 72% on avg. compared to VPR CS reduces delay cost by 53% on avg. compared to VPR
19. Results (Cont.)
300000
Post-routing results comparison
250000
A low-temperature annealing is 200000
applied to the initial results.
150000
Wire length, critical delay and area
are compared. 100000
The results demonstrate the 50000
effectiveness of the pruning 0
techniques, which do not affect the
apex2
apex4
ex1010
tseng
ex5p
s38417
s38584
seq
bigkey
des
clma
diffeq
dsip
misex3
s298
spla
alu4
pdc
frisc
elliptic
quality significantly.
CS-t CS VPR Wire length
CS increases the wire length by 3% on avg.
4.00E+08
4.50E-07
3.50E+08 4.00E-07
3.00E+08 3.50E-07
2.50E+08 3.00E-07
2.00E+08 2.50E-07
1.50E+08 2.00E-07
1.00E+08 1.50E-07
1.00E-07
5.00E+07
5.00E-08
0.00E+00
0.00E+00
s38417
s38584
s298
pdc
alu4
apex2
apex4
ex1010
tseng
des
ex5p
frisc
seq
bigkey
clma
diffeq
dsip
misex3
spla
elliptic
s38417
s38584
s298
pdc
alu4
apex2
apex4
ex1010
tseng
des
ex5p
frisc
seq
bigkey
clma
diffeq
dsip
misex3
spla
elliptic
CS-t CS VPR Area
CS-t CS VPR Critical delay
CS increases the area by 2% on avg. CS increases the crit. delay by 6% on avg.
20. Results (Cont.)
Runtime comparison
Only placement time is compared.
CS-t achieves 31x speedup on average, with up to 91x.
More speedup is expected when circuits become larger.
100
90
80
70
Speedups
60
50
40
30
20
10
0
CS-t CS VPR
Speedups compared to VPR
21. Design Space Exploration
CAD flow
Study logic-level and algorithm-
level design space, respectively.
CSBP flow (a) and from-scratch
flow (b) are compared.
Settings
The logic-level design space
consists of 19 configurations
generated by 19 ABC1 synthesis
scripts in abc.rc.
The algorithm-level design space
consists of 18 configurations of
constant multiplier generated by
CMU SPIRAL [Puschel’04]
varying bits from 7 to 252.
Both CS and CS-t are evaluated.
The benchmarking environments
are the same as logic-level design
space exploration.
1 http://www.eecs.berkeley.edu/~alanmi/abc/
2
CAD flow for design space exploration
Bit = 16 is abandoned due to ABC crash
23. Logic Level Results
2500
Initial results comparison 2000
The number of CLBs and levels vary 1500
widely in logic-level design space. 1000
Show circuit “dsip” as an example. 500
Bounding box cost and delay cost are
0
compared for initial placement
shake
rwsat2
share
resyn2rsdc
resyn2a
choice
compress2rsdc
resyn2
resyn3
choice2
rwsat
src_rs
compress2
src_rw
src_rws
resyn2rs
resyn
compress
compress2rs
results.
CS CS-t VPR
Initial bb cost of “dsip”
CS reduces bb cost by 76% on avg.
4.00E-04
Critical delay 3.00E-04
2.00E-04
1.00E-04
0.00E+00
compress2rs…
resyn2a
resyn2
resyn3
compress2
shake
src_rws
resyn2rs
resyn
compress
rwsat2
share
compress2rs
resyn2rsdc
choice
choice2
rwsat
src_rs
src_rw
CS CS-t VPR Initial delay cost of “dsip”
CS reduces delay cost by 48% on avg.
Characteristics of logic-level design space
24. Logic Level Results (Cont.)
Final placement results
Wire length and critical delay of circuit “dsip” are compared.
The final results produced by CS and CS-t are very close or better
compared to VPR’s, with 32% overhead for wire length and 20%
improvement for critical delay.
100%
100%
80% 80%
Percentage
Percentage
60% 60%
40% 40%
20% 20%
0% 0%
resyn2a
resyn2
resyn3
compress2
shake
src_rws
resyn2rs
resyn
compress
rwsat2
share
compress2rs
resyn2rsdc
choice
compress2rsdc
choice2
rwsat
src_rs
src_rw
resyn2a
resyn2
resyn3
compress2
shake
src_rws
resyn2rs
resyn
compress
rwsat2
share
compress2rs
resyn2rsdc
choice
compress2rsdc
choice2
rwsat
src_rs
src_rw
CS-t CS VPR CS-t CS VPR
Final wire length comparison of “dsip” Final critical delay comparison of “dsip”
25. Logic Level Results (Cont.)
800
700
Design space shape characterization 600
We compare the minimal, median and 500
maximal wire length and critical delay 400
produced by CS, CS-t and VPR. 300
200
We also compare the shapes of each
configuration over 19 designs. 100
0
The almost identical curves show that
compress2…
shake
rwsat2
share
resyn2rsdc
resyn2a
choice
resyn2
resyn3
choice2
rwsat
src_rs
compress2
src_rw
src_rws
resyn2rs
resyn
compress
compress2rs
CSBP is able to accurately depict the
shape of a design space.
vpr cs cs-t
Shape of final wire length of circuit “dsip”
2500
4.5E-07
0.0000004
2000
3.5E-07
0.0000003
1500
2.5E-07
0.0000002
1000
1.5E-07
500 0.0000001
5E-08
0 0
ex1010
apex2
apex4
tseng
des
ex5p
s38417
s38584
bigkey
clma
diffeq
dsip
misex3
s298
seq
spla
pdc
alu4
frisc
elliptic
s38417
s38584
s298
alu4
apex2
apex4
ex1010
pdc
tseng
bigkey
des
ex5p
frisc
seq
spla
clma
diffeq
dsip
misex3
elliptic
vpr-min cs-min cs-t-min vpr-min cs-min cs-t-min
Shape of minimal wire length of 20 circuits over 19 designs Shape of minimal crit. delay of 20 circuits over 19 designs
26. Logic Level Results (Cont.)
Runtime comparison
Only placement time is compared.
CS-t achieves 30x speedup on
average, with up to 100x.
In practice, one can take
advantage of the significant
speedup of CS-t to perform quick
design space exploration.
100
90
80
70
Speedups
60
50
40
30
20
10
0
s38417
s38584
s298
pdc
alu4
apex2
apex4
tseng
ex1010
frisc
des
ex5p
seq
spla
bigkey
clma
diffeq
misex3
dsip
elliptic
CS CS-t VPR
Runtime comparison
Speedups compared to VPR (“*” marked time is measured with a timeout )
27. Algorithm Level Results
Experimental settings
The algorithm-level design is a
constant multiplier.
The design parameter explored in our
experiments is the fractional bits
varying from 7 to 251.
CMU SPIRAL is used to generate
RTL design based on Hcub algorithm
[Voronenko’07]. Characteristics of algorithm-level design
space generated by CMU SPIRAL
Experimental results
The initial and final placement results
are similar to logic-level space
exploration.
CS and CS-t achieve 7x and 30x
speedup compared VPR,
respectively.
An example of a constant parallel multiplier
1 Bit = 16 is abandoned due to ABC crash
28. Algorithm Level Results (Cont.)
Wire length-delay space comparison
The pareto-points, which are the optimal configurations in a design space,
are of most interests to IC designers.
CS and VPR find the same pareto-points.
Bits = 24 is used as the reference circuit.
4.00E-07 4.25E-07
Estimated critical delay
Estimated critical delay
3.50E-07 B19 B25 3.75E-07 B25
B19
B18 B18
3.00E-07 B23 3.25E-07 B23
B22
B17 B22
B21 B17
2.50E-07 B14 B21 2.75E-07 B14
B12 B15
B15 B12
2.00E-07 2.25E-07
B8
B7 B10
B10
1.50E-07 B9 1.75E-07 B8 B9
B7
0 100 200 300 400 500 0 200 400 600
Wire length Wire length
Wire length-delay space of VPR Wire length-delay space of CS
29. Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
30. Future Work
Improvement to CSBP
Integrate predefined matchings, for example, naming matching, into our
CSBP to further enhance both the efficiency and the quality of the design.
Other applications
Study the effectiveness of applying circuit similarity algorithm to other
applications, such as routing and sequential verification for FPGAs
31. Conclusion
Proposed an efficient circuit similarity algorithm
Developed CSBP, a fast circuit similarity-based placement for
FPGAs
Applied CSPB to incremental design and design space exploration.
Open-source tool available at:
http://webdocs.cs.ualberta.ca/~xshi/soft.html
Applied CSBP to incremental design for FPGAs
CSBP is able to reduce engineering effort by capturing the similarity from the
previous design iterations.
CSBP is 31x faster compared to VPR.
Applied CSBP to design space exploration for FPGAs
CSBP can precisely depict the shape of a design space and pinpoint the
optimal designs.
CSBP is 30x faster compared to VPR.
32. Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane
CSBP: A Fast Circuit Similarity-Based Placement for FPGA
Incremental Design and Design Space Exploration
LOGO
www.themegallery.com