Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
2. Moore’s Law
"The number of transistors incorporated in a chip
will approximately double every 24 months."
Gordon Moore, Intel Co-Founder
Originally published in 1965
3.
4. So What’s the Problem?
• Can continue to increase transistors per Moore’s Law
• Cannot continue to increase power or chips will melt
– Power steadily rose with new chips until ~2005 – now 1 volt
• Cannot continue to scale processor frequency
– Have you seen any 10 GHz chips?
Moore’s Law gave no prediction of
continued performance increases
5. Time to “Take the Leap”
“We have reached the limit of what is possible with
one or more traditional, serial central processing
units, or CPUs. It is past time for the computing
industry – and everyone who relies on it for
continued improvements in productivity, economic
growth and social progress – to take the leap into
parallel processing.”
Bill Dally - Chief Scientist at NVIDIA and Professor at Stanford University
http://www.forbes.com/2010/04/29/moores-law-computing-processing-opinions-contributors-bill-dally.html
6. Additional Resources
• Stanford course available on iTunes U
• http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322
– Programming Massively Parallel Processors with
CUDA
– Lectures 1 and 13 are great introductions
• Lecture 13 – The Future of Throughput Computing (Bill Dally)
• Lecture 1 – Introduction to Massively Parallel Computing
7. Guiding Principles
• Performance = Parallelism
– Single-threaded processor performance has flat-
lined at 0-5% annual growth since ~2005
• Efficiency = Locality
– Chips are power limited with most power spent
moving data around
8. Three Types of Parallelism
• Instruction-level parallelism
– Out of order execution, branch prediction, etc.
– Opportunities decreasing
• Data-level parallelism
– SIMD (Single Instruction Multiple Data), GPUs, etc.
– Opportunities increasing
• Thread-level parallelism
– Multithreading, multi-core CPUs, etc.
– Opportunities increasing
9. Taking the Leap
• Three things are required
– Lots of processors
– Efficienct memory storage
– Programming system that abstracts it
10. CPU VS. GPU ARCHITECTURE
CPU GPU
• General purpose • Special purpose
processors processors
• Optimized for • Optimized for data level
instruction level parallelism
parallelism • Many smaller processors
• A few large processors executing single
capable of multi- instructions on multiple
threading data (SIMD)
11. High Performance GPU Computing
• GPUs are getting faster more quickly than CPUs
• Being used in industry for weather simulation,
medical imaging, computational finance, etc.
• Amazon is now offering access to NVIDIA Tesla
GPUs in the cloud as a service ($ vs ¢ per hour)
• GPUs are being used as general purpose parallel
processors – http://gpgpu.org
12. Examples
• CUDA – NVIDIA
• C++ AMP – Microsoft
• OpenCL – Open source
• NPP – NVIDIA (Research done at FamilySearch)
13. CUDA
• Compute Unified Device Architecture
• Proprietary NVIDIA extensions to C for
running code on NVIDIA GPUs
• Other language bindings
– Java – jCUDA, JCuda, JCublas, JCufft
– Python – PyCUDA, KappaCUDA
– .NET – CUDAfy.NET, CUDA.NET
– Ruby – KappaCUDA
– More – Fortran, Perl, Mathematica, MATLAB, etc.
14. C for CUDA Example
// Compute vector sum c = a + b
// Each thread performs one pair-wise addition
__global__ void vector_add(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
{
int main()
{
// Allocate and initialize host (CPU) memory
float* hostA = …, *hostB = …;
// Allocate device (GPU) memory
cudaMalloc((void**) &deviceA, N * sizeof(float));
cudaMalloc((void**) &deviceB, N * sizeof(float));
cudaMalloc((void**) &deviceC, N * sizeof(float));
// Copy host memory to device
cudaMemcpy(deviceA, hostA, N * sizeof(float), cudaMemcpyHostToDevice));
cudaMemcpy(deviceB, hostB, N * sizeof(float), cudaMemcpyHostToDevice));
// Run N/256 blocks of 256 threads each
vector_add<<< N/256, 256>>>(deviceA, deviceB, deviceC);
}
15. Heterogeneous Computing with
Microsoft C++ AMP
• AMP = Accelerated Massive Parallelism
• Designed to take advantage of all the available compute
resources (CPU, integrated & discrete GPUs)
• Coming in the next version of Visual Studio and C++ in
the next year or two
• Cool demo
http://hothardware.com/News/Microsoft-Demos-C-AMP-Heterogeneous-Computing-at-AFDS/
16. EXAMPLE – C++ AMP
void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W)
{
for (int y = 0; y < M; y++) {
for (int x = 0; x < N; x++) {
float sum = 0;
for (int i = 0; i < W; i++)
sum += A(y*W + i] * B[i*N + x);
C[y*N + x] = sum;
}
}
}
void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W)
{
array_view<const float, 2> a (M, W, A), b(W, N, B);
array_view<writeonly<float>, 2>c((M, N, C);
parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) {
float sum = 0;
for (int i = 0; i < a.x; i++)
sum += a(idx.y, i) * b(i, idx.x);
c[idx] = sum;
});
}
17. OpenCL
• Royalty free, cross-platform, vendor neutral
• Managed by Khronos OpenCL working group
(www.khronos.org/opencl)
• Design goal to use all computational resources
– GPUs and CPUs are peers
• Based on C
• Abstract the specifics of underlying hardware
18. Example – OpenCL
void trad_mul(int n, const float *a, const float* b, float* c)
{
for (int i = 0; i < n; i++)
c[i] = a[i] * b[i];
}
kernel void dp_mul(global const float *a, global const float* b, global float* c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
} // Execute over “n’ work-items
19. Image Processing Flow at FamilySearch
Preservation Storage
(Lossless JPEG-2000)
Image Capture
(Uncompressed TIFF) Image
Post-Processing
Microfilm Scanners (DPC)
Digital Cameras
Distribution
Storage
(JPEG - original size)
(JPEG - thumbnails)
20. Digital Processing Center (DPC)
• Collection of servers in a data center used by FamilySearch
to continuously process millions of images annually
• Image post processing operations performed include
– Automatic skew correction
– Automatic document cropping
– Image sharpening
– Image scaling (thumbnail creation)
– Encoding into other image formats
• CPU is a current bottleneck (~12 sec/image)
• Processing requirements continuously rising (number of
images, image size and number of color channels)
21. Computer Graphics vs.
Computer Vision
• Approximate inverses of each other:
– Computer graphics – converting “numbers into pictures”
– Computer vision – converting “pictures into numbers”
• GPUs have traditionally been used for computer
graphics – (Ex. Graphics intensive computer games)
• Recent research, hardware and software are using
GPUs for computer vision (Ex. Using Graphics
Devices in Reverse)
• GPUs generally work well when there is ample data-
level parallelism
22. IMPLEMENTATION OPTIONS
Rack Mount Servers Personal Supercomputer
• Several vendors provide solutions. • GPUs for computing can be placed in
(Ex. One is a 3U rack mount unit a standard workstation. Several
capable of holding 16 GPUs vendors provide solutions.
connected to 8 servers) • Each Tesla GPU requires
• “Compared to typical quad-core – Available double-wide PCIe slot
CPUs, Tesla 20 series computing – Two 6-pin or one 8-pin PCIe power
systems deliver equivalent connectors and sufficient wattage
performance at 1/10th the cost – Recommend 4GB RAM per card, at
and 1/20th the power least 2.33 GHz quad-core CPU and
consumption.” (NVIDIA) 64-bit Linux or Windows
• “250x the computing performance of
a standard workstation” (NVIDIA)
23. Image Processing Performance
with IPP and NPP
• FamilySearch currently uses Intel’s IPP
– Intel Performance Primitives
– Optimize operations on Intel CPUs
– Closed source, licensed
• NVIDIA has produced a similar library called NPP
– NVIDIA Performance Primitives
– Optimize operations on NVIDIA GPUs (CUDA underneath)
– Higher level abstraction to perform image processing on GPUs
– No license for SDK
24. EXAMPLE – NPP
// Declare a host object for an 8-bit grayscale image
npp::ImageCPU_8u_C1 hostSrc;
// Load grayscale image from disk
npp::loadImage(sFilename, hostSrc);
// Declare a device image and upload from host
npp::ImageNPP_8u_C1 deviceSrc(hostSrc);
… [Create padded image]
… [Create Gaussian kernel] … [Create padded image]
… [Create Gaussian kernel]
// Copy kernel to GPU
cudaMemcpy2D(deviceKernel, 12, hostKernel, kernelSize.width
* sizeof(Npp32s), kernelSize.width * sizeof(Npp32s),
kernelSize.height, cudaMemcpyHostToDevice);
// Allocate blurred image of appropriate size // Allocate blurred image of appropriate size (on GPU)
Ipp8u* blurredImg = ippiMalloc_8u_C1(img.getWidth(), npp::ImageNPP_8u_C1 deviceBlurredImg(imgSz.width,
img.getHeight(), &blurredImgStepSz); imgSz.height);
// Perform the filter // Perform the filter
ippiFilter32f_8u_C1R(paddedImgData, nppiFilter_8u_C1R(paddedImg.data(widthOffset,
paddedImage.getStepSize(), blurredImg, heightOffset), paddedImg.pitch(),
blurredImgStepSz, imgSz, kernel, kernelSize, deviceBlurredImg.data(), deviceBlurredImg.pitch(),
kernelAnchor); imgSz, deviceKernel, kernelSize, kernelAnchor,
divisor);
// Declare a host image for the result
npp::ImageCPU_8u_C1
hostBlurredImage(deviceBlurredImg.size());
// Copy the device result data into it
deviceBlurredImg.copyTo(hostBlurredImg.data(),
hostBlurredImg.pitch());
25. Performance Testing Methodology
• Test System Specifications
– Dual Quad Core Intel® Xeon® 2.80GHz i7 CPUs (8 cores
total)
– 6 GB RAM
– 64-bit Windows 7 operating system
– Single Tesla C1060 Compute Processor (240 processing cores
total)
– PCI-Express x16 Gen2 slot
• Three representative grayscale images of increasing size
– Small image – 1726 x 1450 (2.5 megapixels)
– Average image – 4808 x 3940 (18.9 megapixels)
– Large image – 8966 x 6132 (55.0 megapixels)
• Results for each image repeated 3 times and averaged
• Transfer time to/from the GPU is considered part of all
GPU operations
27. AMDAHL’S LAW
Speeding up 25% of an
overall process by 10x is
less of an overall
improvement than
speeding up 75% of an
overall process by 1.5x
28. Takeaways
• Significant performance increases can be realized through
parallelization – may become only way in the future
• GPUs are transforming into general purpose data-parallel
computational coprocessors and outstripping advances in multi-
core CPUs
• Languages, tools and APIs for parallel computing remain relatively
immature, but are improving rapidly
• Relatively small learning curve
– For image processing, NPP’s API nearly perfectly matches Intel’s IPP
– New paradigms around copying to/from GPU and allocating memory
– Can use programming languages familiar to developers without
understanding intricacies of GPU architectures
– Does require rethinking of algorithms to be parallel and building the
computation around the data
Notas del editor
Don’t claim to be expert
Source of much of what I will present – gives a lot more details, coming from people who know a lot more than I do
Even CPUs realize performance is about parallelism – multi-core CPUsPower required increases exponentially with distance – Bill Dally says that lots of arithmetic units actually not hot
GPUs initially only for computer graphics acceleration
Of course want something that is open
Number of images increasing as is size, more color, etc.
Data center servers for large scale places like FamilySearch, Workstations could be put in smaller installations such as an archiveBased on limited survey (most sites don’t list prices)~$5-6K list price for 1U server or personal supercomputer w/2 Teslas~$8-9K list price for 1U server or personal supercomputer w/4 Teslas~$1200 per Tesla
NVIDIA directly going at IPPImaging library structured so that we could create implementation for GPUs to run on a single GPU based server concurrent with current system
Rotating, cropping, sharpening and scaling operations parallelized on GPU