An attempt to quantify the substantial performance improvement observed on Windows 8.1\ Nvidia GTX 780M\Intel HD 4600 via the latest NVIDIA Driver (326.01) that may help other users - particularly of the MATLAB Image Processing and Parallel Computing Toolboxes - to consider upgrading...
Hybrid CPU GPU MATLAB Image Processing Benchmarking
1. The benefits of upgrading to Haswell Architecture
and Windows 8.1:
Benchmarking of Hybrid (CPUGPU) Parallel
Processing (CUDA) – enabled, MATLAB Image
Processing Algorithms in GTX TITAN and GTX 780M
DIMITRIS VAYENAS, POSTGRADUATE STUDENT
DEPARTMENT OF COMPUTER SCIENCE @ THE UNIVERSITY OF OXFORD &
SOFTWARE INCUBATOR AT ISIS INNOVATION LTD.
2. Contents
Introduction
A “Real-Life” Hybrid (CPU-GPU) Algorithm
Hardware and Software of Testing
Performance
Comparison
Conclusion
Acknowledgements
3. Introduction
In this laboratory we are attempting to address the following question:
Is it is worth upgrading from Ivy Bridge to a Haswell Architecture in order to
improve performance?
Intel claims that its new HD 4600 Integrated Graphics Core in the 4th
Generation Intel i7 processors can increase performance over the previous
architecture by up to 7 times.
What kind of performance improvements can we look forward in “real life
examples” and under what conditions?
4. A “Real-Life” Hybrid Algorithm (1/2)
Hybrid: Executes in both CPU and GPU
Consider a MATLAB implemented algorithm containing the following steps:
5. A “Real-Life” Hybrid Algorithm (2/2)
In the hybrid Algorithm the tasks in black are performed in the GPU while
the tasks in red performed in the CPU.
Thus, we have the usual overhead of transferring the data to and from the
GPU whereas the performance of the CPU plays a significant role; this
consideration is usually ignored by most graphics performance
benchmarks who test either the GPU or the CPU, but not both.
Ideally we should have liked to run all tasks in the GPU, however the
current version of MATLAB does not, yet, support these functions in the
Parallel Processing Unit.
As we will see the NVIDIA Drivers have substantial impact on Performance
6. Hardware and Software of Testing
System I:
SCAN Workstation with NVIDIA GTX TITAN, Intel i7 3770K @ 4.5 GHz, 32GB RAM @ 2133
MHz, SSD with over 500 MB/s at Read and Write
OS: Windows Server 2012 Datacentre Edition
NVIDIA Driver: 320.49
System II:
Schenker W503 with NVIDIA GTX 780M, Intel i7 4800 @ 3.5 GHz, 16 GB RAM @1600
MHz, SSD with over 500 MB/s at Read and Write
A) OS: Windows Server 2012 Datacentre Edition
NVIDIA Driver: 320.49
B) OS: Windows 8.1
NVIDIA Driver: 326.01
(Important Notice: Figures for System I on Windows 8.1 will be added here by
Wednesday 3/7/2013)
7. Performance (total runtimes)
Task System I
(TITAN on WinSrv 2012)
System II (a)
(780M on WinSrv 2012)
System II (b)
(780M on Win 8.1)
(number of runs per
test/where (CPU or GPU))
(results in seconds – best is less)
Edge (800/CPU) 1720.265 1661.289 1261.870
Regionprops (400/CPU) 956.622 899.934 646.883
Imfilter (1600/GPU) 339.045 339.477 263.572
Imresize(1200/CPU) 338.574 295.782 199.593
Padarray (2000/CPU) 204.734 196.303 149.067
Imfilter (1600/GPU) 126.362 131.112 101.717
8. Performance (total run times)
1720.265
956.622
339.045 338.574
204.734
1661.289
899.934
339.477 295.782
196.303
1261.87
646.883
263.572
199.593
149.067
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Edge (800) Regionprops (400) Imfilter (1600) Imresize(1200) Padarray (2000)
Task time totals (less is better)
System I System II (a) System II (b)
9. Performance (Indicative times to process an image)
Parameters: Magnification, Fudge Factor, Sigma and HSize
Image Processing System I System II (a) System II (b)
(results in seconds)
Mag_1_FF_0.2_S_0.2_HS_1 0.39699 0.49159 0.18465
Mag_1_FF_0.6_S_0.6_HS_61 0.46689 0.62617 0.38815
Mag_1_FF_1_S_0.8_HS_1 11.4042 8.1427 0.49579
Mag_3_FF_0.4_S_0.8_HS_41 3.1976 2.8881 1.4568
Mag_5_FF_0.4_S_0.8_HS_41 5.7096 4.4588 3.9456
Mag_7_FF_0.4_S_0.8_HS_41 9.1622 10.6905 8.4348
Mag_9_FF_0.4_S_0.8_HS_41 14.5562 17.9971 14.8889
Mag_9_FF_1_S_0.8_HS_41 28.8458 17.0872 15.5799
10. Performance (Indicative times to process an image)
Parameters: Magnification, Fudge Factor, Sigma and HSize
0
0.39699
0.46689
11.4042
3.1976
5.7096
9.1622
14.5562
28.8458
0.49159
0.62617
8.1427
2.8881
4.4588
10.6905
17.9971
17.0872
0.18465
0.38815
0.49579
1.4568
3.9456
8.4348
14.8889
15.5799
EXECUTION TIME IN SECONDS TO PROCESS SPECIFIC IMAGES
System I System II (a) System II (b)
11. Performance Comparison (total run times)
Task System II (a) vs. System I System II (b) vs.
System II (a)
System II (b) vs. System I
(number of runs per
test/where (CPU or GPU))
Percentage Change
Edge (800/CPU) 3.4 24.0 26.6
Regionprops (400/CPU) 5.9 28.1 32.4
Imfilter (1600/GPU) -0.1 22.4 22.3
Imresize(1200/CPU) 12.6 32.5 41.0
Padarray (2000/CPU) 4.1 24.1 27.2
Imfilter (1600/GPU) -3.8 22.4 19.5
12. Performance Comparison (total run times)
3.4
5.9
-0.1
12.6
4.1
-3.8
24
28.1
22.4
32.5
24.1
22.4
26.6
32.4
22.3
41
27.2
-10
-5
0
5
10
15
20
25
30
35
40
45
Percentage Change
System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I
13. Performance Comparison based on the time to process image
Parameters: Magnification, Fudge Factor, Sigma and HSize
Image Processing System II (a) vs.
System I
System II (b) vs. System
II (a)
System II (b) vs.
System I
Percentage of Change
Mag_1_FF_0.2_S_0.2_HS_1 -23.8 62.4 53.5
Mag_1_FF_0.6_S_0.6_HS_61 -34.1 38.0 16.9
Mag_1_FF_1_S_0.8_HS_1 28.6 93.9 95.7
Mag_3_FF_0.4_S_0.8_HS_41 9.7 49.6 54.4
Mag_5_FF_0.4_S_0.8_HS_41 21.9 11.5 30.9
Mag_7_FF_0.4_S_0.8_HS_41 -16.7 21.1 7.9
Mag_9_FF_0.4_S_0.8_HS_41 -23.6 17.3 -2.3
Mag_9_FF_1_S_0.8_HS_41 40.8 8.8 46.0
14. Performance Comparison based on the time to process image
Parameters: Magnification, Fudge Factor, Sigma and HSize
0
-23.8
-34.1
28.6
9.7
21.9
-16.7
-23.6
40.8
62.4
38
93.9
49.6
11.5
21.1 17.3
8.8
53.5
16.9
95.7
54.4
30.9
7.9
-2.3
46
-60
-40
-20
0
20
40
60
80
100
120
Percentage change in image processing
System II (a) vs. System I System II (b) vs. System II (a) System II (b) vs. System I
15. Conclusion
The performance improvements due to the new architecture in Intel’s fourth
generation i7 family are substantial as we notice the great improvements for
related of the i7 4800 Mobile CPU over the overclocked i7 3770K!
NVIDIA also seems to offer improved support of its GTX 7*** Series on Windows
8.1 where we have seen improvement of over 93.9% for a set of parameters
and over 20% overall on an identical hardware running on Windows 8.1 with
326.01 driver vs. the 320.49 driver.
Obviously, measuring the performance of hybrid algorithms is similar to asking
“how long is a piece of spring”, but given the fact that we see manufacturers
fine-tuning their products in order to perform better in standard benchmarking
tools it is always wise to create your own benchmarks that fit your applications
16. Acknowledgements
I would like to thank the following individuals for their help in measuring and
optimising the performance of my MATLAB code, through their extensive
knowledge of MATLAB andor CUDA:
Dr. Mike Giles, Professor of Scientific Computing at the University of Oxford; resident
expert for NVIDIA and MATLAB
Dr. James Lebak, Parallel Computing Software Engineer at MathWorksat
Mathworks Boston HQ.
Captain (USMC) John Roberts, Senior Principal GPGPU Software Engineer at BAE
Systems, Inc. (formerly of NVIDIA); John also heads the CUDA Vision Workbench
project.
I would also like to thank XMG-Schenker for supporting my research effort
through their generous sponsorship of my Schenker W503