SlideShare una empresa de Scribd logo
1 de 28
The
     Rise
        of
  Parallel
Computing
       Ben
      Baker
Moore’s Law

"The number of transistors incorporated in a chip
will approximately double every 24 months."

             Gordon Moore, Intel Co-Founder
                 Originally published in 1965
So What’s the Problem?

• Can continue to increase transistors per Moore’s Law
• Cannot continue to increase power or chips will melt
   – Power steadily rose with new chips until ~2005 – now 1 volt
• Cannot continue to scale processor frequency
   – Have you seen any 10 GHz chips?


            Moore’s Law gave no prediction of
            continued performance increases
Time to “Take the Leap”
“We have reached the limit of what is possible with
one or more traditional, serial central processing
units, or CPUs. It is past time for the computing
industry – and everyone who relies on it for
continued improvements in productivity, economic
growth and social progress – to take the leap into
parallel processing.”

        Bill Dally - Chief Scientist at NVIDIA and Professor at Stanford University
http://www.forbes.com/2010/04/29/moores-law-computing-processing-opinions-contributors-bill-dally.html
Additional Resources

• Stanford course available on iTunes U
•   http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322

     – Programming Massively Parallel Processors with
       CUDA
     – Lectures 1 and 13 are great introductions
          • Lecture 13 – The Future of Throughput Computing (Bill Dally)
          • Lecture 1 – Introduction to Massively Parallel Computing
Guiding Principles
• Performance = Parallelism
  – Single-threaded processor performance has flat-
    lined at 0-5% annual growth since ~2005
• Efficiency = Locality
  – Chips are power limited with most power spent
    moving data around
Three Types of Parallelism
• Instruction-level parallelism
  – Out of order execution, branch prediction, etc.
  – Opportunities decreasing
• Data-level parallelism
  – SIMD (Single Instruction Multiple Data), GPUs, etc.
  – Opportunities increasing
• Thread-level parallelism
  – Multithreading, multi-core CPUs, etc.
  – Opportunities increasing
Taking the Leap

• Three things are required
  – Lots of processors
  – Efficienct memory storage
  – Programming system that abstracts it
CPU VS. GPU ARCHITECTURE
              CPU                          GPU
•   General purpose          •   Special purpose
    processors                   processors
•   Optimized for            •   Optimized for data level
    instruction level            parallelism
    parallelism              •   Many smaller processors
•   A few large processors       executing single
    capable of multi-            instructions on multiple
    threading                    data (SIMD)
High Performance GPU Computing
• GPUs are getting faster more quickly than CPUs
• Being used in industry for weather simulation,
  medical imaging, computational finance, etc.
• Amazon is now offering access to NVIDIA Tesla
  GPUs in the cloud as a service ($ vs ¢ per hour)
• GPUs are being used as general purpose parallel
  processors – http://gpgpu.org
Examples

•   CUDA – NVIDIA
•   C++ AMP – Microsoft
•   OpenCL – Open source
•   NPP – NVIDIA (Research done at FamilySearch)
CUDA
• Compute Unified Device Architecture
• Proprietary NVIDIA extensions to C for
  running code on NVIDIA GPUs
• Other language bindings
  – Java – jCUDA, JCuda, JCublas, JCufft
  – Python – PyCUDA, KappaCUDA
  – .NET – CUDAfy.NET, CUDA.NET
  – Ruby – KappaCUDA
  – More – Fortran, Perl, Mathematica, MATLAB, etc.
C for CUDA Example
// Compute vector sum c = a + b
// Each thread performs one pair-wise addition
__global__ void vector_add(float* A, float* B, float* C)
{
   int i = threadIdx.x + blockDim.x * blockIdx.x;
   C[i] = A[i] + B[i];
{


int main()
{
   // Allocate and initialize host (CPU) memory
   float* hostA = …, *hostB = …;

    // Allocate device (GPU) memory
    cudaMalloc((void**) &deviceA, N * sizeof(float));
    cudaMalloc((void**) &deviceB, N * sizeof(float));
    cudaMalloc((void**) &deviceC, N * sizeof(float));

    // Copy host memory to device
    cudaMemcpy(deviceA, hostA, N * sizeof(float), cudaMemcpyHostToDevice));
    cudaMemcpy(deviceB, hostB, N * sizeof(float), cudaMemcpyHostToDevice));

    // Run N/256 blocks of 256 threads each
    vector_add<<< N/256, 256>>>(deviceA, deviceB, deviceC);
}
Heterogeneous Computing with
        Microsoft C++ AMP
• AMP = Accelerated Massive Parallelism
• Designed to take advantage of all the available compute
  resources (CPU, integrated & discrete GPUs)
• Coming in the next version of Visual Studio and C++ in
  the next year or two
• Cool demo
   http://hothardware.com/News/Microsoft-Demos-C-AMP-Heterogeneous-Computing-at-AFDS/
EXAMPLE – C++ AMP
void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W)
{
   for (int y = 0; y < M; y++) {
      for (int x = 0; x < N; x++) {
         float sum = 0;
            for (int i = 0; i < W; i++)
               sum += A(y*W + i] * B[i*N + x);
            C[y*N + x] = sum;
      }
   }
}



void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W)
{
   array_view<const float, 2> a (M, W, A), b(W, N, B);
   array_view<writeonly<float>, 2>c((M, N, C);

    parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) {
        float sum = 0;
        for (int i = 0; i < a.x; i++)
           sum += a(idx.y, i) * b(i, idx.x);
        c[idx] = sum;
    });
}
OpenCL

• Royalty free, cross-platform, vendor neutral
• Managed by Khronos OpenCL working group
  (www.khronos.org/opencl)
• Design goal to use all computational resources
  – GPUs and CPUs are peers
• Based on C
• Abstract the specifics of underlying hardware
Example – OpenCL
void trad_mul(int n, const float *a, const float* b, float* c)
{
  for (int i = 0; i < n; i++)
    c[i] = a[i] * b[i];
}



 kernel void dp_mul(global const float *a, global const float* b, global float* c)
{
  int id = get_global_id(0);
  c[id] = a[id] * b[id];
} // Execute over “n’ work-items
Image Processing Flow at FamilySearch
                                   Preservation Storage
                                   (Lossless JPEG-2000)


Image Capture
(Uncompressed TIFF)   Image
                      Post-Processing
Microfilm Scanners    (DPC)
Digital Cameras

                                        Distribution
                                        Storage
                                        (JPEG - original size)
                                        (JPEG - thumbnails)
Digital Processing Center (DPC)
• Collection of servers in a data center used by FamilySearch
  to continuously process millions of images annually
• Image post processing operations performed include
   –   Automatic skew correction
   –   Automatic document cropping
   –   Image sharpening
   –   Image scaling (thumbnail creation)
   –   Encoding into other image formats
• CPU is a current bottleneck (~12 sec/image)
• Processing requirements continuously rising (number of
  images, image size and number of color channels)
Computer Graphics vs.
          Computer Vision
• Approximate inverses of each other:
   – Computer graphics – converting “numbers into pictures”
   – Computer vision – converting “pictures into numbers”
• GPUs have traditionally been used for computer
  graphics – (Ex. Graphics intensive computer games)
• Recent research, hardware and software are using
  GPUs for computer vision (Ex. Using Graphics
  Devices in Reverse)
• GPUs generally work well when there is ample data-
  level parallelism
IMPLEMENTATION OPTIONS
Rack Mount Servers                   Personal Supercomputer
• Several vendors provide solutions. • GPUs for computing can be placed in
  (Ex. One is a 3U rack mount unit      a standard workstation. Several
  capable of holding 16 GPUs            vendors provide solutions.
  connected to 8 servers)            • Each Tesla GPU requires
• “Compared to typical quad-core         – Available double-wide PCIe slot
  CPUs, Tesla 20 series computing        – Two 6-pin or one 8-pin PCIe power
  systems deliver equivalent                connectors and sufficient wattage
  performance at 1/10th the cost         – Recommend 4GB RAM per card, at
  and 1/20th the power                      least 2.33 GHz quad-core CPU and
  consumption.” (NVIDIA)                    64-bit Linux or Windows
                                     • “250x the computing performance of
                                        a standard workstation” (NVIDIA)
Image Processing Performance
                 with IPP and NPP
• FamilySearch currently uses Intel’s IPP
   – Intel Performance Primitives
   – Optimize operations on Intel CPUs
   – Closed source, licensed


• NVIDIA has produced a similar library called NPP
   –   NVIDIA Performance Primitives
   –   Optimize operations on NVIDIA GPUs (CUDA underneath)
   –   Higher level abstraction to perform image processing on GPUs
   –   No license for SDK
EXAMPLE – NPP
                                                       // Declare a host object for an 8-bit grayscale image
                                                       npp::ImageCPU_8u_C1 hostSrc;

                                                       // Load grayscale image from disk
                                                       npp::loadImage(sFilename, hostSrc);

                                                       // Declare a device image and upload from host
                                                       npp::ImageNPP_8u_C1 deviceSrc(hostSrc);
… [Create padded image]
… [Create Gaussian kernel]                             … [Create padded image]
                                                       … [Create Gaussian kernel]

                                                       // Copy kernel to GPU
                                                       cudaMemcpy2D(deviceKernel, 12, hostKernel, kernelSize.width
                                                            * sizeof(Npp32s), kernelSize.width * sizeof(Npp32s),
                                                            kernelSize.height, cudaMemcpyHostToDevice);

// Allocate blurred image of appropriate size          // Allocate blurred image of appropriate size (on GPU)
Ipp8u* blurredImg = ippiMalloc_8u_C1(img.getWidth(),   npp::ImageNPP_8u_C1 deviceBlurredImg(imgSz.width,
     img.getHeight(), &blurredImgStepSz);                   imgSz.height);

// Perform the filter                                  // Perform the filter
ippiFilter32f_8u_C1R(paddedImgData,                    nppiFilter_8u_C1R(paddedImg.data(widthOffset,
     paddedImage.getStepSize(), blurredImg,                 heightOffset), paddedImg.pitch(),
     blurredImgStepSz, imgSz, kernel, kernelSize,           deviceBlurredImg.data(), deviceBlurredImg.pitch(),
     kernelAnchor);                                         imgSz, deviceKernel, kernelSize, kernelAnchor,
                                                            divisor);

                                                       // Declare a host image for the result
                                                       npp::ImageCPU_8u_C1
                                                            hostBlurredImage(deviceBlurredImg.size());

                                                       // Copy the device result data into it
                                                       deviceBlurredImg.copyTo(hostBlurredImg.data(),
                                                            hostBlurredImg.pitch());
Performance Testing Methodology
• Test System Specifications
    – Dual Quad Core Intel® Xeon® 2.80GHz i7 CPUs (8 cores
      total)
    – 6 GB RAM
    – 64-bit Windows 7 operating system
    – Single Tesla C1060 Compute Processor (240 processing cores
      total)
    – PCI-Express x16 Gen2 slot
• Three representative grayscale images of increasing size
    – Small image – 1726 x 1450 (2.5 megapixels)
    – Average image – 4808 x 3940 (18.9 megapixels)
    – Large image – 8966 x 6132 (55.0 megapixels)
• Results for each image repeated 3 times and averaged
• Transfer time to/from the GPU is considered part of all
  GPU operations
•   Combining operations minimizes GPU/CPU transfers
•   5 – 6x speed up, increasing slightly with image size
AMDAHL’S LAW

Speeding up 25% of an
overall process by 10x is
less of an overall
improvement than
speeding up 75% of an
overall process by 1.5x
Takeaways
• Significant performance increases can be realized through
  parallelization – may become only way in the future
• GPUs are transforming into general purpose data-parallel
  computational coprocessors and outstripping advances in multi-
  core CPUs
• Languages, tools and APIs for parallel computing remain relatively
  immature, but are improving rapidly
• Relatively small learning curve
    – For image processing, NPP’s API nearly perfectly matches Intel’s IPP
    – New paradigms around copying to/from GPU and allocating memory
    – Can use programming languages familiar to developers without
      understanding intricacies of GPU architectures
    – Does require rethinking of algorithms to be parallel and building the
      computation around the data

Más contenido relacionado

La actualidad más candente

Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...Edge AI and Vision Alliance
 

La actualidad más candente (20)

Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
Cuda
CudaCuda
Cuda
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 

Destacado

Viewing Closest Relatives in the My Relatives View Paper
Viewing Closest Relatives in the My Relatives View PaperViewing Closest Relatives in the My Relatives View Paper
Viewing Closest Relatives in the My Relatives View Paperbakers84
 
A Whirlwind Tour of FamilySearch Resources - 2013 Presentation
A Whirlwind Tour of FamilySearch Resources - 2013 PresentationA Whirlwind Tour of FamilySearch Resources - 2013 Presentation
A Whirlwind Tour of FamilySearch Resources - 2013 Presentationbakers84
 
A Whirlwind Tour of FamilySearch Resources - 2012
A Whirlwind Tour of FamilySearch Resources - 2012A Whirlwind Tour of FamilySearch Resources - 2012
A Whirlwind Tour of FamilySearch Resources - 2012bakers84
 
A Whirlwind Tour of FamilySearch Resources - 2013 URL List
A Whirlwind Tour of FamilySearch Resources - 2013 URL ListA Whirlwind Tour of FamilySearch Resources - 2013 URL List
A Whirlwind Tour of FamilySearch Resources - 2013 URL Listbakers84
 
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'bakers84
 
FamilySearch Family Tree Essentials - Find, Take, Teach Webinar
FamilySearch Family Tree Essentials - Find, Take, Teach WebinarFamilySearch Family Tree Essentials - Find, Take, Teach Webinar
FamilySearch Family Tree Essentials - Find, Take, Teach Webinarbakers84
 
How to Get More Involved in Your Family History Despite Limited Available Time
How to Get More Involved in Your Family History Despite Limited Available TimeHow to Get More Involved in Your Family History Despite Limited Available Time
How to Get More Involved in Your Family History Despite Limited Available Timebakers84
 
Merging People in FamilySearch Family Tree - Presentation
Merging People in FamilySearch Family Tree - PresentationMerging People in FamilySearch Family Tree - Presentation
Merging People in FamilySearch Family Tree - Presentationbakers84
 
Help! My Family Is All Messed Up on FamilySearch Family Tree!
Help! My Family Is All Messed Up on FamilySearch Family Tree!Help! My Family Is All Messed Up on FamilySearch Family Tree!
Help! My Family Is All Messed Up on FamilySearch Family Tree!bakers84
 
FamilySearch Insider Tips and Tricks - Presentation
FamilySearch Insider Tips and Tricks - PresentationFamilySearch Insider Tips and Tricks - Presentation
FamilySearch Insider Tips and Tricks - Presentationbakers84
 
Start and Grow Your Family Tree on FamilySearch.org - Presentation
Start and Grow Your Family Tree on FamilySearch.org - PresentationStart and Grow Your Family Tree on FamilySearch.org - Presentation
Start and Grow Your Family Tree on FamilySearch.org - Presentationbakers84
 
What I Wish Everyone in the LDS Church Knew About Family History
What I Wish Everyone in the LDS Church Knew About Family HistoryWhat I Wish Everyone in the LDS Church Knew About Family History
What I Wish Everyone in the LDS Church Knew About Family Historybakers84
 
The Evolution of Technology and Family History
The Evolution of Technology and Family HistoryThe Evolution of Technology and Family History
The Evolution of Technology and Family Historybakers84
 

Destacado (13)

Viewing Closest Relatives in the My Relatives View Paper
Viewing Closest Relatives in the My Relatives View PaperViewing Closest Relatives in the My Relatives View Paper
Viewing Closest Relatives in the My Relatives View Paper
 
A Whirlwind Tour of FamilySearch Resources - 2013 Presentation
A Whirlwind Tour of FamilySearch Resources - 2013 PresentationA Whirlwind Tour of FamilySearch Resources - 2013 Presentation
A Whirlwind Tour of FamilySearch Resources - 2013 Presentation
 
A Whirlwind Tour of FamilySearch Resources - 2012
A Whirlwind Tour of FamilySearch Resources - 2012A Whirlwind Tour of FamilySearch Resources - 2012
A Whirlwind Tour of FamilySearch Resources - 2012
 
A Whirlwind Tour of FamilySearch Resources - 2013 URL List
A Whirlwind Tour of FamilySearch Resources - 2013 URL ListA Whirlwind Tour of FamilySearch Resources - 2013 URL List
A Whirlwind Tour of FamilySearch Resources - 2013 URL List
 
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'
 
FamilySearch Family Tree Essentials - Find, Take, Teach Webinar
FamilySearch Family Tree Essentials - Find, Take, Teach WebinarFamilySearch Family Tree Essentials - Find, Take, Teach Webinar
FamilySearch Family Tree Essentials - Find, Take, Teach Webinar
 
How to Get More Involved in Your Family History Despite Limited Available Time
How to Get More Involved in Your Family History Despite Limited Available TimeHow to Get More Involved in Your Family History Despite Limited Available Time
How to Get More Involved in Your Family History Despite Limited Available Time
 
Merging People in FamilySearch Family Tree - Presentation
Merging People in FamilySearch Family Tree - PresentationMerging People in FamilySearch Family Tree - Presentation
Merging People in FamilySearch Family Tree - Presentation
 
Help! My Family Is All Messed Up on FamilySearch Family Tree!
Help! My Family Is All Messed Up on FamilySearch Family Tree!Help! My Family Is All Messed Up on FamilySearch Family Tree!
Help! My Family Is All Messed Up on FamilySearch Family Tree!
 
FamilySearch Insider Tips and Tricks - Presentation
FamilySearch Insider Tips and Tricks - PresentationFamilySearch Insider Tips and Tricks - Presentation
FamilySearch Insider Tips and Tricks - Presentation
 
Start and Grow Your Family Tree on FamilySearch.org - Presentation
Start and Grow Your Family Tree on FamilySearch.org - PresentationStart and Grow Your Family Tree on FamilySearch.org - Presentation
Start and Grow Your Family Tree on FamilySearch.org - Presentation
 
What I Wish Everyone in the LDS Church Knew About Family History
What I Wish Everyone in the LDS Church Knew About Family HistoryWhat I Wish Everyone in the LDS Church Knew About Family History
What I Wish Everyone in the LDS Church Knew About Family History
 
The Evolution of Technology and Family History
The Evolution of Technology and Family HistoryThe Evolution of Technology and Family History
The Evolution of Technology and Family History
 

Similar a The Rise of Parallel Computing

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unitDayakar Siddula
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)Budianto Tandianus
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化NVIDIA Taiwan
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014Puppet
 

Similar a The Rise of Parallel Computing (20)

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unit
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 

Más de bakers84

Civil Registration Records in Latin America and Spain - Presentation
Civil Registration Records in Latin America and Spain - PresentationCivil Registration Records in Latin America and Spain - Presentation
Civil Registration Records in Latin America and Spain - Presentationbakers84
 
Civil Registration Records in Latin America and Spain - Handout
Civil Registration Records in Latin America and Spain - HandoutCivil Registration Records in Latin America and Spain - Handout
Civil Registration Records in Latin America and Spain - Handoutbakers84
 
Finding Relatives in Spanish Church Records
Finding Relatives in Spanish Church RecordsFinding Relatives in Spanish Church Records
Finding Relatives in Spanish Church Recordsbakers84
 
Artificial Intelligence and the Coming Revolution of Family History - Handout
Artificial Intelligence and the Coming Revolution of Family History - HandoutArtificial Intelligence and the Coming Revolution of Family History - Handout
Artificial Intelligence and the Coming Revolution of Family History - Handoutbakers84
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...bakers84
 
Leveraging the Consultant Planner - Presentation
Leveraging the Consultant Planner - PresentationLeveraging the Consultant Planner - Presentation
Leveraging the Consultant Planner - Presentationbakers84
 
Leveraging the Consultant Planner Syllabus
Leveraging the Consultant Planner SyllabusLeveraging the Consultant Planner Syllabus
Leveraging the Consultant Planner Syllabusbakers84
 
A Peek Under the Hood at FamilySearch - Presentation
A Peek Under the Hood at FamilySearch - PresentationA Peek Under the Hood at FamilySearch - Presentation
A Peek Under the Hood at FamilySearch - Presentationbakers84
 
The Coming Explosion of Records at FamilySearch - Presentation
The Coming Explosion of Records at FamilySearch - PresentationThe Coming Explosion of Records at FamilySearch - Presentation
The Coming Explosion of Records at FamilySearch - Presentationbakers84
 
The Coming Explosion of Records at FamilySearch Syllabus
The Coming Explosion of Records at FamilySearch SyllabusThe Coming Explosion of Records at FamilySearch Syllabus
The Coming Explosion of Records at FamilySearch Syllabusbakers84
 
A Peek Under the Hood at FamilySearch Syllabus
A Peek Under the Hood at FamilySearch SyllabusA Peek Under the Hood at FamilySearch Syllabus
A Peek Under the Hood at FamilySearch Syllabusbakers84
 
Meaningful Family History in an Hour Syllabus
Meaningful Family History in an Hour SyllabusMeaningful Family History in an Hour Syllabus
Meaningful Family History in an Hour Syllabusbakers84
 
Meaningful Family History In an Hour - Presentation
Meaningful Family History In an Hour - PresentationMeaningful Family History In an Hour - Presentation
Meaningful Family History In an Hour - Presentationbakers84
 
Viewing Closest Relatives in the My Relatives View Poster
Viewing Closest Relatives in the My Relatives View PosterViewing Closest Relatives in the My Relatives View Poster
Viewing Closest Relatives in the My Relatives View Posterbakers84
 
FamilySearch Insider Tips and Tricks - Syllabus
FamilySearch Insider Tips and Tricks - SyllabusFamilySearch Insider Tips and Tricks - Syllabus
FamilySearch Insider Tips and Tricks - Syllabusbakers84
 

Más de bakers84 (15)

Civil Registration Records in Latin America and Spain - Presentation
Civil Registration Records in Latin America and Spain - PresentationCivil Registration Records in Latin America and Spain - Presentation
Civil Registration Records in Latin America and Spain - Presentation
 
Civil Registration Records in Latin America and Spain - Handout
Civil Registration Records in Latin America and Spain - HandoutCivil Registration Records in Latin America and Spain - Handout
Civil Registration Records in Latin America and Spain - Handout
 
Finding Relatives in Spanish Church Records
Finding Relatives in Spanish Church RecordsFinding Relatives in Spanish Church Records
Finding Relatives in Spanish Church Records
 
Artificial Intelligence and the Coming Revolution of Family History - Handout
Artificial Intelligence and the Coming Revolution of Family History - HandoutArtificial Intelligence and the Coming Revolution of Family History - Handout
Artificial Intelligence and the Coming Revolution of Family History - Handout
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
 
Leveraging the Consultant Planner - Presentation
Leveraging the Consultant Planner - PresentationLeveraging the Consultant Planner - Presentation
Leveraging the Consultant Planner - Presentation
 
Leveraging the Consultant Planner Syllabus
Leveraging the Consultant Planner SyllabusLeveraging the Consultant Planner Syllabus
Leveraging the Consultant Planner Syllabus
 
A Peek Under the Hood at FamilySearch - Presentation
A Peek Under the Hood at FamilySearch - PresentationA Peek Under the Hood at FamilySearch - Presentation
A Peek Under the Hood at FamilySearch - Presentation
 
The Coming Explosion of Records at FamilySearch - Presentation
The Coming Explosion of Records at FamilySearch - PresentationThe Coming Explosion of Records at FamilySearch - Presentation
The Coming Explosion of Records at FamilySearch - Presentation
 
The Coming Explosion of Records at FamilySearch Syllabus
The Coming Explosion of Records at FamilySearch SyllabusThe Coming Explosion of Records at FamilySearch Syllabus
The Coming Explosion of Records at FamilySearch Syllabus
 
A Peek Under the Hood at FamilySearch Syllabus
A Peek Under the Hood at FamilySearch SyllabusA Peek Under the Hood at FamilySearch Syllabus
A Peek Under the Hood at FamilySearch Syllabus
 
Meaningful Family History in an Hour Syllabus
Meaningful Family History in an Hour SyllabusMeaningful Family History in an Hour Syllabus
Meaningful Family History in an Hour Syllabus
 
Meaningful Family History In an Hour - Presentation
Meaningful Family History In an Hour - PresentationMeaningful Family History In an Hour - Presentation
Meaningful Family History In an Hour - Presentation
 
Viewing Closest Relatives in the My Relatives View Poster
Viewing Closest Relatives in the My Relatives View PosterViewing Closest Relatives in the My Relatives View Poster
Viewing Closest Relatives in the My Relatives View Poster
 
FamilySearch Insider Tips and Tricks - Syllabus
FamilySearch Insider Tips and Tricks - SyllabusFamilySearch Insider Tips and Tricks - Syllabus
FamilySearch Insider Tips and Tricks - Syllabus
 

Último

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Último (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

The Rise of Parallel Computing

  • 1. The Rise of Parallel Computing Ben Baker
  • 2. Moore’s Law "The number of transistors incorporated in a chip will approximately double every 24 months." Gordon Moore, Intel Co-Founder Originally published in 1965
  • 3.
  • 4. So What’s the Problem? • Can continue to increase transistors per Moore’s Law • Cannot continue to increase power or chips will melt – Power steadily rose with new chips until ~2005 – now 1 volt • Cannot continue to scale processor frequency – Have you seen any 10 GHz chips? Moore’s Law gave no prediction of continued performance increases
  • 5. Time to “Take the Leap” “We have reached the limit of what is possible with one or more traditional, serial central processing units, or CPUs. It is past time for the computing industry – and everyone who relies on it for continued improvements in productivity, economic growth and social progress – to take the leap into parallel processing.” Bill Dally - Chief Scientist at NVIDIA and Professor at Stanford University http://www.forbes.com/2010/04/29/moores-law-computing-processing-opinions-contributors-bill-dally.html
  • 6. Additional Resources • Stanford course available on iTunes U • http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322 – Programming Massively Parallel Processors with CUDA – Lectures 1 and 13 are great introductions • Lecture 13 – The Future of Throughput Computing (Bill Dally) • Lecture 1 – Introduction to Massively Parallel Computing
  • 7. Guiding Principles • Performance = Parallelism – Single-threaded processor performance has flat- lined at 0-5% annual growth since ~2005 • Efficiency = Locality – Chips are power limited with most power spent moving data around
  • 8. Three Types of Parallelism • Instruction-level parallelism – Out of order execution, branch prediction, etc. – Opportunities decreasing • Data-level parallelism – SIMD (Single Instruction Multiple Data), GPUs, etc. – Opportunities increasing • Thread-level parallelism – Multithreading, multi-core CPUs, etc. – Opportunities increasing
  • 9. Taking the Leap • Three things are required – Lots of processors – Efficienct memory storage – Programming system that abstracts it
  • 10. CPU VS. GPU ARCHITECTURE CPU GPU • General purpose • Special purpose processors processors • Optimized for • Optimized for data level instruction level parallelism parallelism • Many smaller processors • A few large processors executing single capable of multi- instructions on multiple threading data (SIMD)
  • 11. High Performance GPU Computing • GPUs are getting faster more quickly than CPUs • Being used in industry for weather simulation, medical imaging, computational finance, etc. • Amazon is now offering access to NVIDIA Tesla GPUs in the cloud as a service ($ vs ¢ per hour) • GPUs are being used as general purpose parallel processors – http://gpgpu.org
  • 12. Examples • CUDA – NVIDIA • C++ AMP – Microsoft • OpenCL – Open source • NPP – NVIDIA (Research done at FamilySearch)
  • 13. CUDA • Compute Unified Device Architecture • Proprietary NVIDIA extensions to C for running code on NVIDIA GPUs • Other language bindings – Java – jCUDA, JCuda, JCublas, JCufft – Python – PyCUDA, KappaCUDA – .NET – CUDAfy.NET, CUDA.NET – Ruby – KappaCUDA – More – Fortran, Perl, Mathematica, MATLAB, etc.
  • 14. C for CUDA Example // Compute vector sum c = a + b // Each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; { int main() { // Allocate and initialize host (CPU) memory float* hostA = …, *hostB = …; // Allocate device (GPU) memory cudaMalloc((void**) &deviceA, N * sizeof(float)); cudaMalloc((void**) &deviceB, N * sizeof(float)); cudaMalloc((void**) &deviceC, N * sizeof(float)); // Copy host memory to device cudaMemcpy(deviceA, hostA, N * sizeof(float), cudaMemcpyHostToDevice)); cudaMemcpy(deviceB, hostB, N * sizeof(float), cudaMemcpyHostToDevice)); // Run N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(deviceA, deviceB, deviceC); }
  • 15. Heterogeneous Computing with Microsoft C++ AMP • AMP = Accelerated Massive Parallelism • Designed to take advantage of all the available compute resources (CPU, integrated & discrete GPUs) • Coming in the next version of Visual Studio and C++ in the next year or two • Cool demo http://hothardware.com/News/Microsoft-Demos-C-AMP-Heterogeneous-Computing-at-AFDS/
  • 16. EXAMPLE – C++ AMP void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W) { for (int y = 0; y < M; y++) { for (int x = 0; x < N; x++) { float sum = 0; for (int i = 0; i < W; i++) sum += A(y*W + i] * B[i*N + x); C[y*N + x] = sum; } } } void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W) { array_view<const float, 2> a (M, W, A), b(W, N, B); array_view<writeonly<float>, 2>c((M, N, C); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { float sum = 0; for (int i = 0; i < a.x; i++) sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum; }); }
  • 17. OpenCL • Royalty free, cross-platform, vendor neutral • Managed by Khronos OpenCL working group (www.khronos.org/opencl) • Design goal to use all computational resources – GPUs and CPUs are peers • Based on C • Abstract the specifics of underlying hardware
  • 18. Example – OpenCL void trad_mul(int n, const float *a, const float* b, float* c) { for (int i = 0; i < n; i++) c[i] = a[i] * b[i]; } kernel void dp_mul(global const float *a, global const float* b, global float* c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // Execute over “n’ work-items
  • 19. Image Processing Flow at FamilySearch Preservation Storage (Lossless JPEG-2000) Image Capture (Uncompressed TIFF) Image Post-Processing Microfilm Scanners (DPC) Digital Cameras Distribution Storage (JPEG - original size) (JPEG - thumbnails)
  • 20. Digital Processing Center (DPC) • Collection of servers in a data center used by FamilySearch to continuously process millions of images annually • Image post processing operations performed include – Automatic skew correction – Automatic document cropping – Image sharpening – Image scaling (thumbnail creation) – Encoding into other image formats • CPU is a current bottleneck (~12 sec/image) • Processing requirements continuously rising (number of images, image size and number of color channels)
  • 21. Computer Graphics vs. Computer Vision • Approximate inverses of each other: – Computer graphics – converting “numbers into pictures” – Computer vision – converting “pictures into numbers” • GPUs have traditionally been used for computer graphics – (Ex. Graphics intensive computer games) • Recent research, hardware and software are using GPUs for computer vision (Ex. Using Graphics Devices in Reverse) • GPUs generally work well when there is ample data- level parallelism
  • 22. IMPLEMENTATION OPTIONS Rack Mount Servers Personal Supercomputer • Several vendors provide solutions. • GPUs for computing can be placed in (Ex. One is a 3U rack mount unit a standard workstation. Several capable of holding 16 GPUs vendors provide solutions. connected to 8 servers) • Each Tesla GPU requires • “Compared to typical quad-core – Available double-wide PCIe slot CPUs, Tesla 20 series computing – Two 6-pin or one 8-pin PCIe power systems deliver equivalent connectors and sufficient wattage performance at 1/10th the cost – Recommend 4GB RAM per card, at and 1/20th the power least 2.33 GHz quad-core CPU and consumption.” (NVIDIA) 64-bit Linux or Windows • “250x the computing performance of a standard workstation” (NVIDIA)
  • 23. Image Processing Performance with IPP and NPP • FamilySearch currently uses Intel’s IPP – Intel Performance Primitives – Optimize operations on Intel CPUs – Closed source, licensed • NVIDIA has produced a similar library called NPP – NVIDIA Performance Primitives – Optimize operations on NVIDIA GPUs (CUDA underneath) – Higher level abstraction to perform image processing on GPUs – No license for SDK
  • 24. EXAMPLE – NPP // Declare a host object for an 8-bit grayscale image npp::ImageCPU_8u_C1 hostSrc; // Load grayscale image from disk npp::loadImage(sFilename, hostSrc); // Declare a device image and upload from host npp::ImageNPP_8u_C1 deviceSrc(hostSrc); … [Create padded image] … [Create Gaussian kernel] … [Create padded image] … [Create Gaussian kernel] // Copy kernel to GPU cudaMemcpy2D(deviceKernel, 12, hostKernel, kernelSize.width * sizeof(Npp32s), kernelSize.width * sizeof(Npp32s), kernelSize.height, cudaMemcpyHostToDevice); // Allocate blurred image of appropriate size // Allocate blurred image of appropriate size (on GPU) Ipp8u* blurredImg = ippiMalloc_8u_C1(img.getWidth(), npp::ImageNPP_8u_C1 deviceBlurredImg(imgSz.width, img.getHeight(), &blurredImgStepSz); imgSz.height); // Perform the filter // Perform the filter ippiFilter32f_8u_C1R(paddedImgData, nppiFilter_8u_C1R(paddedImg.data(widthOffset, paddedImage.getStepSize(), blurredImg, heightOffset), paddedImg.pitch(), blurredImgStepSz, imgSz, kernel, kernelSize, deviceBlurredImg.data(), deviceBlurredImg.pitch(), kernelAnchor); imgSz, deviceKernel, kernelSize, kernelAnchor, divisor); // Declare a host image for the result npp::ImageCPU_8u_C1 hostBlurredImage(deviceBlurredImg.size()); // Copy the device result data into it deviceBlurredImg.copyTo(hostBlurredImg.data(), hostBlurredImg.pitch());
  • 25. Performance Testing Methodology • Test System Specifications – Dual Quad Core Intel® Xeon® 2.80GHz i7 CPUs (8 cores total) – 6 GB RAM – 64-bit Windows 7 operating system – Single Tesla C1060 Compute Processor (240 processing cores total) – PCI-Express x16 Gen2 slot • Three representative grayscale images of increasing size – Small image – 1726 x 1450 (2.5 megapixels) – Average image – 4808 x 3940 (18.9 megapixels) – Large image – 8966 x 6132 (55.0 megapixels) • Results for each image repeated 3 times and averaged • Transfer time to/from the GPU is considered part of all GPU operations
  • 26. Combining operations minimizes GPU/CPU transfers • 5 – 6x speed up, increasing slightly with image size
  • 27. AMDAHL’S LAW Speeding up 25% of an overall process by 10x is less of an overall improvement than speeding up 75% of an overall process by 1.5x
  • 28. Takeaways • Significant performance increases can be realized through parallelization – may become only way in the future • GPUs are transforming into general purpose data-parallel computational coprocessors and outstripping advances in multi- core CPUs • Languages, tools and APIs for parallel computing remain relatively immature, but are improving rapidly • Relatively small learning curve – For image processing, NPP’s API nearly perfectly matches Intel’s IPP – New paradigms around copying to/from GPU and allocating memory – Can use programming languages familiar to developers without understanding intricacies of GPU architectures – Does require rethinking of algorithms to be parallel and building the computation around the data

Notas del editor

  1. Don’t claim to be expert
  2. Source of much of what I will present – gives a lot more details, coming from people who know a lot more than I do
  3. Even CPUs realize performance is about parallelism – multi-core CPUsPower required increases exponentially with distance – Bill Dally says that lots of arithmetic units actually not hot
  4. GPUs initially only for computer graphics acceleration
  5. Of course want something that is open
  6. Number of images increasing as is size, more color, etc.
  7. Data center servers for large scale places like FamilySearch, Workstations could be put in smaller installations such as an archiveBased on limited survey (most sites don’t list prices)~$5-6K list price for 1U server or personal supercomputer w/2 Teslas~$8-9K list price for 1U server or personal supercomputer w/4 Teslas~$1200 per Tesla
  8. NVIDIA directly going at IPPImaging library structured so that we could create implementation for GPUs to run on a single GPU based server concurrent with current system
  9. Rotating, cropping, sharpening and scaling operations parallelized on GPU