Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

Florian Wende
Zuse-Institute Berlin
Connected Component
Labeling on Xeon Phi
Parallelization & Vectorization

wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
Connected Component Labeling
Suppose we are given the following image . . .

. . . and we are to assign unique labels to different connected regions!

. . . and we are to assign unique labels to different connected regions!
. . . In parallel?
 Computer Vision
Detect connected regions in images
 Computational Physics
Cluster algorithms for the Ising model
 Percolation Theory
How to achieve the labeling? . . .

1. Labeling algorithm
2. Parallelization
a. Parallel implementation on CPU
b. Run the CPU code on the Xeon Phi
c. Adapt the code for the Xeon Phi
3. Vectorization (SIMD)
d. Leave it to the compiler (auto-vectorization)
e. SIMD intrinsic functions
Xeon Phi: 512-Bit SIMD unit for 16 x 32-bit words
Connected Component Labeling - Strategy

 Breadth/Depth first search algorithm, multi-pass algorithms
 Hoshen-Kopelman algorithm
 Cluster self-labeling algorithm by Coddington and Baillie
1. Assign a unique label to each pixel of the image
2. For each pixel consider its adjacent connected pixels in positive 1-, 2-, . . .
direction and set the respective labels to the minimum value each
3. If for all pixels the minimum operation is the identity function: Finished!
Otherwise: Continue with step 2
CPU: Hoshen-Kopelman
Xeon Phi: Hoshen-Kopelman vs. Cluster self-labeling
Connected Component Labeling - Algorithm

Partition the image into equal-sized sub-images, and label them
independently using multiple threads
Connected Comp. Labeling - Parallelization

Partition the image into equal-sized sub-images, and label them
independently using multiple threads
 Unique labels across
different sub-images
 Connected regions that
extend over multiple sub-
images are merged after the
labeling using atomic
primitives
Thread 0
Thread 2
Thread 4
Thread 6
Thread 1
Thread 3
Thread 5
Thread 7
Connected Comp. Labeling - Parallelization

Example: Self-labeling within sub-image of thread 2
 Process multiple data simultaneously using SIMD instructions
Connected Comp. Labeling - Vectorization

1. Initialize labeling (array index)

2. Load row[0] into reg0, and
create mask for adjacent
entries in positive 1-direction:
1 if equal-colored
0 otherwise
1-direction

2. Load row[0] into reg0, and
create mask for adjacent
entries in positive 1-direction:
1 if equal-colored
0 otherwise
3. Overlap each element in reg0 with its
adjacent element in positive 1-direction,
and write the result to reg1

4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1

5. Write back entries in reg1 to
row[0] using the mask

6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1

7. Shift all bits in mask one position up, and write the pairwise minimum
entries in row[0] and reg1 to row[0] using the shifted mask

7. Shift all bits in mask one position up, and write the pairwise minimum
entries in row[0] and reg1 to row[0] using the shifted mask
8. Did labels change?

Result of the operations up to now . . .
Set adjacent connected
elements in row[0] to the
pairwise minimum value each
Before
After
Repeat the procedure for the 2-direction.
1-direction
2-direction

Repeat the procedure for all other rows as long as labels change . . .
Before
After
Now: Merge labels across different sub-images using atomics!
Finished!

CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz
 Hoshen-Kopelman algorithm + Atomics for label merging
 Vectorization was left to the compiler: there are no masked SIMD intrinsics!
Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz
 Hoshen-Kopelman vs. Cluster self-labeling + Atomics for label merging
 Vectorization by means of _mm512_[mask]_XXX() instrinsics
Parallelization by means of OpenMP: #pragma omp parallel {...}
Programming effort: approx. 2-3 days for the CPU code (incl. optimization)
less than 1 day for the Xeon Phi code (based on CPU code)
Connected Comp. Labeling - Benchmark

CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz
Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz
Application: Swendsen-Wang cluster algorithm for the 2D Ising model
Connected Comp. Labeling - Benchmark

Work partially funded by
BMBF Grant No. 01IH11004G
Dr. Thomas Steinke
Zuse-Institute Berlin (ZIB)
Dr. Michael Klemm
Intel GmbH, Germany
Acknowledgement

[1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithms
for Spin Models – Sequential and Parallel, 1991.
[2] Hoshen, J. and Kopelman, R. Percolation and Cluster Distribution.
I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.
Phys. Rev. B 14, 3438–3445, 1976
[3] R. H. Swendsen and J.-S. Wang. Nonuniversal Critical Dynamics in
Monte Carlo Simulations. Phys. Rev. Lett., 58:86–88, Jan 1987.
[4] Intel Corp. Intel Xeon Phi Coprocessor 5110P, Product Brief, 2012.
References

Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

Recomendados

Recomendados

Más contenido relacionado

Similar a Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

Similar a Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization (20)

Más de Intel IT Center

Más de Intel IT Center (20)

Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization