Cuda project paper

CUDA Speed Up for Side Information Generation In
Distributed Video Coding
Ping-Shang Wang
National Taiwan University
R98944043@ntu.edu.tw
Kan-Han Lu
National Taiwan Normal University
698470271@ntnu.edu.tw
Cong-Min Huang
National Taiwan University
R98548012@ntu.edu.tw
ABSTRACT
Distributed Video Coding (DVC) has become increasingly
popular in recent times among the researchers in video coding due
to its attractive and promising features. DVC primarily has a
modified complexity balance between the encoder and decoder, in
contrast to conventional video codecs. However, Most of the
reported DVC schemes have a high time-delay in decoder which
hinders its practical application in real-time systems. In this work,
we focus on speed up the Side Information(SI) generation module
in DVC, which is a major function in the DVC coding algorithm
and one of the time-consuming factor at the decoder. By applied it
through Compute Unified Device Architecture (CUDA) based on
General-Purpose Graphics Processing Unit (GPGPU), the
experimental results show that a considerable speedup can be
obtained by using the proposed parallelized SI generation
algorithm.
Keywords
Distributed video coding, Wyner-Ziv video coding, Side
information, frame interpolation, CUDA.
1. INTRODUCTION
Distributed video coding (DVC) has recently attracted a vast
amount of attention from the video coding community all around
the world. The new coding paradigm is also known as Wyner-Ziv
(WZ) video coding, which is based on Slepian-Wolf [1] and
Wyner-Ziv [2] theorems. These theorems mainly state that
separate encoding and joint decoding of two correlated sources, X
and Y, can be encoded to the same minimum rate as joint
encoding and decoding in the conventional video coding. DVC
codec subverts the traditional prediction-based standard video
coding scheme by exploiting the source statistics at the decoder
with the development of simpler encoders. That is, the
complexities of encoder and decoder are reversed. Hence the
encoder becomes fairly simple and leaves all the computationally
expensive processing to the decoder. This is done by shifting the
complex procedure of motion estimation/compensation from the
encoder to the decoder. In contrast to conventional coders the
motion estimation is thus only done at the decoder side. It is used
to generate a motion compensated prediction Y , called side
information(SI), of the original frame X, so SI may be seen as a
“corrupted” version of the original information, and error
correcting codes (LDPC or Turbo code) are typically used to
improve the quality of the side information until a target quality
for the final decoded frame is achieved.
These features are effectively utilized in several application
domains, e.g. video conferencing with mobile devices, wireless
video cameras and wireless low-power surveillance. However,
most of the reported DVC architectures face a common problem:
high decoding complexity, which restrains them from being used
in real-time video application. The complexity arises mainly from
two factors: one is iterative LDPC (or Turbo) decoding process
with a feedback channel, and the other is motion estimation
procedure in the Side Information (SI) generation. In order to
obtain a solution which is more suitable for practical applications,
new ideas have been proposed to amend or to optimize the
structure of decoder. However, SI generation plays a key role in
determining the performance of the codec and the reconstructed
video quality is also sensitive to the side information. It is
common that reduced cost on a motion search for a faster
generation may cause a sharp decrease in PSNR. Besides,
abundant channel decoding iterations guarantee decoding bit
accuracy, which is also critical to the video quality. For these
reasons, instead of reducing some computing steps, we are
inclined to adopt parallel approaches to achieve a faster
implementation with complete computation. With the Graphics
Processing Unit (GPU) becoming more powerful and widespread,
GPU are finding broader applications in scientific and general
purpose computation. Our proposed parallel approach utilizes
merely a low-grade NVIDIA GPU to significantly reduce the time
necessary for SI generation while keeping the same SI quality.
The paper is organized as follows. First, we introduce the DVC
codec we used [3] and its SI generation module is discussed in
section 2. Then, the proposed parallel approaches to SI generation
are introduced in section 3. The experimental results are
demonstrated in section. Finally, section 5 concludes the paper.
2. DISCOVER Codec
Figure 1 - DISCOVER codec architecture

2.1 Introduction
The DISCOVER WZ video codec , developed by a European
project funded under the European Commission 1ST FP6
programme, is based on the early Stanford WZ video coding
architecture proposed in [4, 5], further information may be
obtained at [3, 6]. Its architecture is illustrated in Figure 1. The
DISCOVER codec is probably the most efficient WZ video codec
now available. Its performance is reported in detail with the
corresponding test conditions in [6]; moreover, executable code
may be downloaded, allowing all researchers to compare
performances for other sequences and conditions as well.
In this work, we only parallelized the SI generation module
using GPGPU, the remaining modules are the same. Due to scope
of this work, we only describe the SI generation module. For
details of other modules, see [5].
2.2 SI Generation Module in DISCOVER
Codec
The following techniques [7][8] are used to obtain high quality
side information. Fig. 2 shows the architecture proposed for the
frame interpolation scheme. First, forward motion estimation from
Xb to Xf is performed. A block matching based on a modified
MAD (mean absolute difference) criterion is used in order to
regularize the motion vector field, which favors motion vectors
closer to the origin. Then, bidirectional motion estimation is
performed in order to find symmetric motion vectors from the
current WZ frame to Xb and Xf. Spatial motion smoothing based
on a weighted vector median filter is applied afterwards to the
obtained motion field to remove outliers. Finally, motion
compensation is performed between Xb and Xf along the obtained
motion field, so as to generate the side information. A hierarchical
coarse-to-fine approach is used in the bidirectional motion
estimation: the first iteration corresponds to a large block size
(16×16) and tracks fast motion reliably, while the second iteration
achieves higher precision using a smaller block size (8×8). The
motion search is performed using the half-pixel precision method
described in [9].
Among the processes mentioned above, the most time-
consuming steps are forward motion estimation and FIR filter (for
half-pixel motion estimation), which comprise about 70% and
25% of the entire procedure, respectively. Consequently, we
focused on parallelizing these two parts using GPGPU to reduce
the processing time.
3. PARALLEL APPROACH TO SI-
GENERATION ON GPGPU PLATFORM
3.1 Parallelized Forward Motion Estimation
The proposed parallel algorithm for Forward motion estimation is
implemented on GPGPU platform using CUDA. We parallelize
this part at block level to induce the least thread overhead, To
promise the balanced workload on each core, we also use the
indexes of WZ blocks gathered before decoding and allocate
almost the same number of blocks to each core. For simplicity, we
illustrate the proposed approach with an example. The input
sequence is QCIF (176x144), and block size is 16x16, 99 blocks
in target frame.
The parallel processing of the forward motion estimation is
shown in Figure 4. First, we launch a CUDA kernel to compute
the motion filed between past frame and future frame, which each
block in future frame map to a respective CUDA block, and each
have 1024-4096 candidate blocks within search range in reference
frame. We using 512 threads per CUDA block to compute the
cost(modified MAD) of each candidate block in parallel fashion.
Then, each thread keeps the local minimum cost(among all
candidate blocks it processed) and its respective motion vector in
shared memory. Hence, we have 512 local minimum costs when
all threads is done for a CUDA block, the next thing we need to
do is pick the global minimum cost among local minimum costs
by a reduction algorithm we refer form [10]. Finally, we get the
motion filed when CUDA kernel is done and keep the result in
device memory for another CUDA kernel launch after to find
correspond motion vector that closest to the origin of the block in
interpolated frame and copy the result back to the host. The
reference frame is transmit to device and store in GPU`s texture
memory, which can access faster when multiple read of the same
position . And each 16x16block in future frame is store in shared
memory for each thread in CUDA block to access faster.
Moreover, the local minimum cost and its corresponding motion
vector for each thread are store in shared memory for the same
reason. Moreover, we do Loop unrolling, avoid bank conflicts in
shared memory, and minimize the number of accesses of global
memory.
Figure 3 – Reduction Algorithm

3.2 Parallelized FIR Filter
The utilized FIR filter[9] references several neighbor pixel
locations to interpolate the resulting pixel, and therefore has a
higher complexity. We improved the filter performance by
parallelizing the upsampling process at pixel level.
4. EXPERIMENTAL RESULT
All evaluations are run on a PC with an AMD Athlon 64 X2
5600+ CPU at 2.91GHz (1MB cache) and an NVIDIA GeForce
GT220 graphics card. The test sequences are Foreman, Soccer,
Coastguard and Hall Monitor with QCIF resolution, 15 Hz frame
rate and whole sequences.GOP size is 2 and the 8-th quantization
table(Q=8) is used. In addition, the spending time presented here
only include each component in SI generation processing time, so
order to focusing on the performance evaluation of SI generation
module. All the time units are reported in seconds.
The SI generation processing time for all test sequences is
illustrated in Table 1. It is shown that we can achieve 14.15 times
(avg) speed up for forward motion estimation and 6.87 times (avg)
speed up for FIR filter. For entire SI generation procedure, we can
achieve 9.46 times (avg) speed up.
B1
B2
B99
We have also tested the algorithm on a PC with a less powerful
CPU and the same grade NVIDIA graphics card, which resulted
in even higher increase of processing speed (20-24 times), but not
reported here. Therefore, the experimental result is highly depend
on the power of CPU and GPU.
5. CONCLUSIONS
In this paper, a parallel algorithm based on GPGPU using
NVIDIA CUDA for SI generation in distributed video coding was
proposed. To achieve a load balancing and optimal runtime, we
presented a dynamic distribution scheme based on a task tool
model and threshold searching method. Experimental results
demonstrate that our algorithm can achieve up to 10 times (avg)
faster than sequential processing of the side information module.
6. REFERENCES
[1] D. Slepian and J. Wolf, "Noiseless coding of correlated
information sources," IEEE Trans. Inf. Theory, vol. 19, no. 4,
pp. 471-480, 1973.
[2] A. D. Wyner and J. Ziv, "The rate-distortion function for
source coding with side information at the decoder," IEEE
Trans. Inf. Theory, vol. 22, pp. 1-10, 1976.
4096
candidates
4096
candidates
4096
candidates
4096
candidates
4096
candidates
4096
candidates
4096
candidates
4096
candidates
176/16
144/16
Kernel
Shared MemoryFeature Frame
Figure 4 - Parallel approach for forward motion estimation

[3] X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov and
M. Ouaret, "The discover codec: Architecture, techniques
and evaluation," Nov, 2007.
[4] A. A. Aaron, S. Rane, E. Setton, and B. Girod, “Transform-
Domain Wyner–Ziv Codec for Video,” Visual
Communications and Image Processing, San Jose, CA,
January 2004.
[5] B. Girod, A. Aaron, S. Rane, and D. Rebollo-
Monedero,“Distributed Video Coding,” Proceedings of the
IEEE, vol. 93, no. 1, pp. 71–83, January 2005.
[6] DISCOVER Page,
http://www.img.lx.it.pt/~discover/home.html
[7] J. Ascenso, C. Brites and F. Pereira “Content Adaptive
Wyner-Ziv Video Coding Driven by Motion Activity”, IEEE
International Conference on Image Processing, Atlanta, USA,
October 2006.
[8] J. Ascenso, C. Brites and F. Pereira, “Improving Frame
Interpolation with Spatial Motion Smoothing for Pixel
Domain Distributed Video Coding”, 5th EURASIP
Conference on Speech and Image Processing, Multimedia
Communications and Services, Smolenice, Slovak Republic,
July 2005.
[9] S. Klomp, Y. Vatis and J. Ostermann, “Side Information
Interpolation with Sub-pel Motion Compensation for Wyner-
Ziv Decoder”, Int. Conf. on Signal Processing and
Multimedia Applications, Setúbal, Portugal, August 2006.
[10] Mark Harris, “Optimizing parallel reduction in CUDA”,
NVIDIA Developer Technology, 2007.
Table 1. SI Generation time for test sequences
SI Generation Time (ms) CPU GPGPU CPU/GPGPU
Foreman (74 WZ frames, 76 key frames)
FIR filter 1228(11.8%) 178(16.5%) 6.90
Forward ME 8850 (85.2%) 594 (54.8%) 14.90
Others 306 (3%) 311 (28.7%) -
Total 10384 (100%) 1083 (100%) 9.59
Average (per WZ frame) 140.32 14.64 9.59
Soccer (74 WZ frames, 76 key frames)
FIR filter 1173(11.3%) 173(16.8%) 6.78
Forward ME 8911 (86.0%) 593 (57.5%) 15.03
Others 266 (2.7%) 265 (25.7%) -
Total 10350 (100%) 1031 (100%) 10.04
Coastguard (74 WZ frames, 76 key frames)
FIR filter 1267 (12.3%) 181(15.3%) 7.00
Forward ME 8769(84.9%) 705(59.7%) 12.44
Others 294 (2.8%) 294 (25.0%) -
Total 10330 (100%) 1180 (100%) 8.75
Hall Monitor (81 WZ frames, 83 key frames)
FIR filter 1386(12.2%) 204(16.9%) 6.79
Forward ME 9702(85.0%) 682(56.6%) 14.23
Others 322 (2.8%) 319 (26.5%) -
Total 11410 (100%) 1205 (100%) 9.47

Cuda project paper

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (14)

Similar a Cuda project paper

Similar a Cuda project paper (20)

Más de Kan-Han (John) Lu

Más de Kan-Han (John) Lu (20)

Último

Último (20)

Cuda project paper