2. wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
Connected Component Labeling
Suppose we are given the following image . . .
3. . . . and we are to assign unique labels to different connected regions!
Connected Component Labeling
wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
4. . . . and we are to assign unique labels to different connected regions!
. . . In parallel?
Computer Vision
Detect connected regions in images
Computational Physics
Cluster algorithms for the Ising model
Percolation Theory
How to achieve the labeling? . . .
Connected Component Labeling
wende@zib.de Connected Component Labeling on Xeon Phi 2ISC13, Leipzig
5. 1. Labeling algorithm
2. Parallelization
a. Parallel implementation on CPU
b. Run the CPU code on the Xeon Phi
c. Adapt the code for the Xeon Phi
3. Vectorization (SIMD)
d. Leave it to the compiler (auto-vectorization)
e. SIMD intrinsic functions
Xeon Phi: 512-Bit SIMD unit for 16 x 32-bit words
Connected Component Labeling - Strategy
wende@zib.de Connected Component Labeling on Xeon Phi 3ISC13, Leipzig
6. Breadth/Depth first search algorithm, multi-pass algorithms
Hoshen-Kopelman algorithm
Cluster self-labeling algorithm by Coddington and Baillie
1. Assign a unique label to each pixel of the image
2. For each pixel consider its adjacent connected pixels in positive 1-, 2-, . . .
direction and set the respective labels to the minimum value each
3. If for all pixels the minimum operation is the identity function: Finished!
Otherwise: Continue with step 2
CPU: Hoshen-Kopelman
Xeon Phi: Hoshen-Kopelman vs. Cluster self-labeling
Connected Component Labeling - Algorithm
wende@zib.de Connected Component Labeling on Xeon Phi 4ISC13, Leipzig
7. Partition the image into equal-sized sub-images, and label them
independently using multiple threads
Connected Comp. Labeling - Parallelization
wende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
8. Partition the image into equal-sized sub-images, and label them
independently using multiple threads
Unique labels across
different sub-images
Connected regions that
extend over multiple sub-
images are merged after the
labeling using atomic
primitives
Thread 0
Thread 2
Thread 4
Thread 6
Thread 1
Thread 3
Thread 5
Thread 7
Connected Comp. Labeling - Parallelization
wende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
9. Example: Self-labeling within sub-image of thread 2
Process multiple data simultaneously using SIMD instructions
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
10. Process multiple data simultaneously using SIMD instructions
1. Initialize labeling (array index)
Example: Self-labeling within sub-image of thread 2
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
11. 1. Initialize labeling (array index)
2. Load row[0] into reg0, and
create mask for adjacent
entries in positive 1-direction:
1 if equal-colored
0 otherwise
Example: Self-labeling within sub-image of thread 2
Process multiple data simultaneously using SIMD instructions
1-direction
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
12. 1. Initialize labeling (array index)
2. Load row[0] into reg0, and
create mask for adjacent
entries in positive 1-direction:
1 if equal-colored
0 otherwise
3. Overlap each element in reg0 with its
adjacent element in positive 1-direction,
and write the result to reg1
Example: Self-labeling within sub-image of thread 2
Process multiple data simultaneously using SIMD instructions
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
13. 4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
14. 4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
15. 4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
16. 4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1
7. Shift all bits in mask one position up, and write the pairwise minimum
entries in row[0] and reg1 to row[0] using the shifted mask
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
17. 4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1
7. Shift all bits in mask one position up, and write the pairwise minimum
entries in row[0] and reg1 to row[0] using the shifted mask
8. Did labels change?
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
18. Result of the operations up to now . . .
Set adjacent connected
elements in row[0] to the
pairwise minimum value each
Before
After
Repeat the procedure for the 2-direction.
1-direction
2-direction
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 7ISC13, Leipzig
19. Repeat the procedure for all other rows as long as labels change . . .
Before
After
Now: Merge labels across different sub-images using atomics!
Finished!
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 8ISC13, Leipzig
20. CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz
Hoshen-Kopelman algorithm + Atomics for label merging
Vectorization was left to the compiler: there are no masked SIMD intrinsics!
Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz
Hoshen-Kopelman vs. Cluster self-labeling + Atomics for label merging
Vectorization by means of _mm512_[mask]_XXX() instrinsics
Parallelization by means of OpenMP: #pragma omp parallel {...}
Programming effort: approx. 2-3 days for the CPU code (incl. optimization)
less than 1 day for the Xeon Phi code (based on CPU code)
Connected Comp. Labeling - Benchmark
wende@zib.de Connected Component Labeling on Xeon Phi 9ISC13, Leipzig
23. Work partially funded by
BMBF Grant No. 01IH11004G
Dr. Thomas Steinke
Zuse-Institute Berlin (ZIB)
Dr. Michael Klemm
Intel GmbH, Germany
Acknowledgement
wende@zib.de Connected Component Labeling on Xeon Phi 11ISC13, Leipzig
24. [1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithms
for Spin Models – Sequential and Parallel, 1991.
[2] Hoshen, J. and Kopelman, R. Percolation and Cluster Distribution.
I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.
Phys. Rev. B 14, 3438–3445, 1976
[3] R. H. Swendsen and J.-S. Wang. Nonuniversal Critical Dynamics in
Monte Carlo Simulations. Phys. Rev. Lett., 58:86–88, Jan 1987.
[4] Intel Corp. Intel Xeon Phi Coprocessor 5110P, Product Brief, 2012.
References
wende@zib.de Connected Component Labeling on Xeon Phi 12ISC13, Leipzig