The document presents research on improving the performance of wavelet transforms through lifting scheme cores. It introduces a lifting core as a processing unit that can continuously consume input and produce output while visiting each sample once in a cache-friendly manner. It discusses how lifting cores can handle borders, be configured for different processing orders, and allow reorganization of the underlying scheme for better parallelization and vectorization. The thesis aims to address shortcomings of prior methods through experimental evaluation of lifting cores on CPUs, GPUs, and FPGAs for 2D and 3D transforms as well as JPEG 2000 compression.
Water Industry Process Automation & Control Monthly - April 2024
Lifting Scheme Cores for Wavelet Transform
1. Lifting Scheme Cores for Wavelet Transform
David Barina
(supervised by Pavel Zemcik)
1 / 24
2. DWT in image processing
can be found in many image-processing tasks
analysis
(edge detection, feature extraction, multiscale representation),
compression (JPEG 2000, Dirac),
watermarking, edge sharpening, contrast enhancement,
tone mapping, denoising, fusion, etc.
2 / 24
3. Filter bank
S. Mallat, ”A theory for multiresolution signal decomposition: The wavelet representation” (1989)
˜H(z−1
) a
d
↓ 2
+
˜G(z−1
) ↓ 2
↑ 2 H(z)
↑ 2 G(z)
decomposition: two complementary filters,
high number of operations
3 / 24
4. Lifting scheme
I. Daubechies, W. Sweldens, ”Factoring wavelet transforms into lifting steps” (1998)
a
d
split ˜P(z−1
)T
P(z) merge
P(z) =
I−1
i=0
1 Si(z)
0 1
1 0
Ti(z) 1
K 0
0 1/K
decomposition: sequence of simple filtering steps,
reduces the number of operations, split: even, odd
4 / 24
6. 2-D decomposition
S. Mallat, ”A theory for multiresolution signal decomposition: The wavelet representation” (1989)
a h
v d
horizontal vertical
h
v d
a h
dv
image: 2-D signal, by a series of 1-D transforms, four subbands,
multi-scale decomposition
6 / 24
8. Strategies and issues
R. Kutil, ”A single-loop approach to SIMD parallelization of 2-D wavelet lifting” (2006)
a h
v d
horizontal vertical
strategies row-column, block-based, and line-based
cache issues cache line, limited size, set associativity, prefetching
techniques padding, aggregation, memory layouts,
interleave loops, parallelization
the approaches have to repeatedly visit samples,
memory access is expensive ⇒ CPU cache, limitations,
existing techniques, single-loop approach
8 / 24
9. Unsolved issues
2 × 2
prolog
core
epilog
prolog epilog
F
F
FF
complicated border treatment (prolog/epilog phases)
suspend/resume processing
arbitrary processing order (scan order)
interleave the transform and a subsequent processing
multi-scale decomposition
reorganization of underlying scheme
9 / 24
10. Objectives of the thesis
Aims improve image transform performance and resource
consumption
Objectives eliminate the shortcomings of existing methods
previous slide
Evaluation prove experimentally
performance, memory requirements
10 / 24
11. Lifting core
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
solution: a processing unit
continuously consumes an input and produces an output
which visits every image sample only once (cache friendly)
which is aware of image coordinates (can handle the borders)
whose configuration (state) can be saved/restored
which can be run in any direction
which can be SIMD vectorized
which can run in parallel (on independent parts of the image)
y = C x
x
def
= In B y
def
= On B
11 / 24
12. Core examples
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
α
β
γ
δ m
n
1 2 3 4
core inputs, outputs
12 / 24
13. Processing orders
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
horizontal horiz. strips horiz. blocks
vertical vert. strips vert. blocks
13 / 24
14. Borders treatment
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
d a d a d a d a d a d a d a d a d a d
d a d a d a d a d a d a d a d a d a d a
n n n n n n n
a d aad
n nnnn
d a d a d a d a d a d a d a d a d a d
0
d a d a d a d a d a d a d a d a d a d a
2 n N − 2 N
0 0
n n n n n n
a
y = Cn x
cores gracefully treats the boundaries
14 / 24
15. Parallel cores and reorganization
M. Kula, D. Barina, et al., ”Block-based Approach to 2-D Wavelet Transform on GPUs” (2016)
1 2 3 4
Sweldens1995
1 2 3
Iwahashi2007
1 2
proposed
15 / 24
16. 3-D core
D. Barina, P. Zemcik, ”Real-Time 3-D Wavelet Lifting” (2015)
x
y
z
buffer x
buffer y
buffer z
extended into more dimensions, buffers on the sides
16 / 24
17. CPU implementation
D. Barina, P. Zemcik, ”Vectorization and parallelization of 2-D wavelet lifting” (in press)
0.0 s
5.0ns
10.0ns
15.0ns
20.0ns
25.0ns
30.0ns
35.0ns
40.0ns
45.0ns
50.0ns
1.0k 10.0k 100.0k 1.0M 10.0M 100.0M
time/pixel
pixels
separable approach
core approach
an evaluation of approaches,
implemented the separable, single-loop, and core
17 / 24
18. 3-D CPU implementation
D. Barina, P. Zemcik, ”Real-Time 3-D Wavelet Lifting” (2015)
x
y
z
buffer x
buffer y
buffer z
0.0 s
20.0ns
40.0ns
60.0ns
80.0ns
100.0ns
120.0ns
140.0ns
160.0ns
0.0 50.0M 100.0M 150.0M 200.0M 250.0M
time/voxel
voxels
naive horizontal
naive vertical
core 42
core 23
core 43
performance of 3-D transform: separable, 2-D core, 3-D core
18 / 24
19. GPU implementation
M. Kula, D. Barina, et al., ”Block-based Approach to 2-D Wavelet Transform on GPUs” (2016)
80.0
100.0
120.0
140.0
160.0
180.0
200.0
220.0
240.0
260.0
0.0 10.0M 20.0M 30.0M 40.0M 50.0M 60.0M 70.0M
GB/s
pixels
Kucis2014
Separable Block
Non-Separable Block
0
10
20
30
40
50
60
100kpel 1Mpel 10Mpel 100Mpel
GB/s
Sweldens
Iwahashi*
Explosive*
Monolithic*
Polyphase*
Monolithic∗
scheme:
left: SotA is in red, block methods in blue/green, reorganization
right: block methods, separable in black, our in blue/green
19 / 24
20. FPGA implementation
D. Barina, et al., ”Single-Loop Approach to 2-D Wavelet Lifting with JPEG 2000 Compatibility” (2015)
H V
BRAM
Input Transform
core FF LUT BRAM
latency 4 441 (0.1 %) 399 (0.18 %) 6 (1.1 %)
latency 2 391 (< 0.1 %) 592 (0.27 %) 6 (1.1 %)
architecture device BRAM [bits] clocks/pel time [ms]
Dillen2003 VirtexE1000-8 50K 0.50 1.20
Descampe2004 Virtex-II XC2V6000 N/A 0.60 1.75
Seo2007 Altera Stratix 128K 2.64 6.02
Zhang2012 Virtex-II Pro XC2VP30 6 × 18K 0.50 0.97
the cores Zynq XC7Z045 1 × 36K 0.26 0.27
20 / 24
21. JPEG 2000 implementation
D. Barina, O. Klima, P. Zemcik, ”Single-Loop Architecture for JPEG 2000” (2016)
core
codeblock
2 × 2cn
2 × 2cm
aj
aj+1
h v d
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
100.0k 1.0M 10.0M 100.0M 1.0G
time[ns]
resolution [pel]
proposed
OpenJPEG
JasPer
FFmpeg
21 / 24
22. Contributions of the thesis
Aims improved image transform performance and resource
consumption
Objectives eliminated the shortcomings of existing methods
Evaluation assessed experimentally
(performance, memory requirements)
evaluation performed:
2-D on CPU, 3-D on CPU, 2-D on GPU, 2-D on FPGA,
JPEG 2000 on CPU
22 / 24
23. Selected papers
Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Software Architecture for JPEG 2000. In
Data Compression Conference (DCC), 2016
Barina, D.; Musil, M.; Musil, P.; et al.: Single-Loop Approach to 2-D Wavelet Lifting with
JPEG 2000 Compatibility. In Workshop on Applications for MultiCore Architectures
(WAMCA), 2015
Barina, D.; Zemcik, P.: Minimum Memory Vectorisation of Wavelet Lifting. In Advanced
Concepts for Intelligent Vision Systems (ACIVS), 2013
Barina, D.; Zemcik, P.: Wavelet Lifting on Application Specific Vector Processor. In
GraphiCon, 2013
Barina, D.; Zemcik, P.: Diagonal Vectorisation of 2-D Wavelet Lifting. In IEEE International
Conference on Image Processing (ICIP), 2014
Barina, D.; Zemcik, P.: Real-Time 3-D Wavelet Lifting. In International Conference in
Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), 2015
Barina, D.; Zemcik, P.: Vectorization and parallelization of 2-D wavelet lifting. Journal of
Real-Time Image Processing (JRTIP), in press
Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Architecture for JPEG 2000. In: Image and
Signal Processing (ICISP), 2016
Kula, M.; Barina, D.; Zemcik, P.: Block-based Approach to 2-D Wavelet Transform on GPUs.
In International Conference on Information Technology – New Generations (ITNG), 2016
Kucis, M.; Barina, D.; Kula, M.; et al.: 2-D Discrete Wavelet Transform Using GPU. In
Workshop on Application for Multi-Core Architectures (WAMCA), 2014
23 / 24
24. Summary
the core
computing unit which processes the data in a single pass,
can suspend/resume execution,
can processes the data in many different orders,
can handle signal boundaries (is aware of coordinates),
can be easily SIMD vectorized and parallelized,
and whose underlying scheme can be reorganized.
24 / 24