SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
General Purpose Computation on
Graphics Processors (GPGPU)
Mike Houston, Stanford University
2Mike Houston - Stanford University Graphics Lab
A little about me
http://graphics.stanford.edu/~mhouston
Education:
– UC San Diego, Computer Science BS
– Stanford University, Computer Science MS
– Currently a PhD candidate at Stanford University
Research
– Parallel Rendering
– High performance computing
– Computation on graphics processors (GPGPU)
3Mike Houston - Stanford University Graphics Lab
What can you do on GPUs other than graphics?
Large matrix/vector operations (BLAS)
Protein Folding (Molecular Dynamics)
FFT (SETI, signal processing)
Ray Tracing
Physics Simulation [cloth, fluid, collision]
Sequence Matching (Hidden Markov Models)
Speech Recognition (Hidden Markov Models, Neural nets)
Databases
Sort/Search
Medical Imaging (image segmentation, processing)
And many, many more…
http://www.gpgpu.org
4Mike Houston - Stanford University Graphics Lab
Why use GPUs?
COTS
– In every machine
Performance
– Intel 3.0 GHz Pentium 4
• 12 GFLOPs peak (MAD)
• 5.96 GB/s to main memory
– ATI Radeon X1800XT
• 120 GFLOPs peak (fragment engine)
• 42 GB/s to video memory
5Mike Houston - Stanford University Graphics Lab
Task vs. Data parallelism
Task parallel
– Independent processes with little communication
– Easy to use
• “Free” on modern operating systems with SMP
Data parallel
– Lots of data on which the same computation is being executed
– No dependencies between data elements in each step in the
computation
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
6Mike Houston - Stanford University Graphics Lab
CPU vs. GPU
CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads
– High performance on a single thread of execution
GPU
– Lots of math units
– Fast access to onboard memory
– Run a program on each fragment/vertex
– High throughput on parallel tasks
CPUs are great for task parallelism
GPUs are great for data parallelism
7Mike Houston - Stanford University Graphics Lab
The Importance of Data Parallelism for GPUs
GPUs are designed for highly parallel tasks like rendering
GPUs process independent vertices and fragments
– Temporary registers are zeroed
– No shared or static data
– No read-modify-write buffers
– In short, no communication between vertices or fragments
Data-parallel processing
– GPU architectures are ALU-heavy
• Multiple vertex & pixel pipelines
• Lots of compute power
– GPU memory systems are designed to stream data
• Linear access patterns can be prefetched
• Hide memory latency
Courtesy GPGPU.org
8Mike Houston - Stanford University Graphics Lab
GPGPU Terminology
9Mike Houston - Stanford University Graphics Lab
Arithmetic Intensity
Arithmetic intensity
– Math operations per word transferred
– Computation / bandwidth
Ideal apps to target GPGPU have:
– Large data sets
– High parallelism
– Minimal dependencies between data elements
– High arithmetic intensity
– Lots of work to do without CPU intervention
Courtesy GPGPU.org
10Mike Houston - Stanford University Graphics Lab
Data Streams & Kernels
Streams
– Collection of records requiring similar computation
• Vertex positions, Voxels, FEM cells, etc.
– Provide data parallelism
Kernels
– Functions applied to each element in stream
• transforms, PDE, …
– No dependencies between stream elements
• Encourage high Arithmetic Intensity
Courtesy GPGPU.org
11Mike Houston - Stanford University Graphics Lab
Scatter vs. Gather
Gather
– Indirect read from memory ( x = a[i] )
– Naturally maps to a texture fetch
– Used to access data structures and data streams
Scatter
– Indirect write to memory ( a[i] = x )
– Difficult to emulate:
• Render to vertex array
• Sorting buffer
– Needed for building many data structures
– Usually done on the CPU
12Mike Houston - Stanford University Graphics Lab
Mapping algorithms to the GPU
13Mike Houston - Stanford University Graphics Lab
Mapping CPU algorithms to the GPU
Basics
– Stream/Arrays -> Textures
– Parallel loops -> Quads
– Loop body -> vertex + fragment program
– Output arrays -> render targets
– Memory read -> texture fetch
– Memory write -> framebuffer write
Controlling the parallel loop
– Rasterization = Kernel Invocation
– Texture Coordinates = Computational Domain
– Vertex Coordinates = Computational Range
Courtesy GPGPU.org
14Mike Houston - Stanford University Graphics Lab
Computational Resources
Programmable parallel processors
– Vertex & Fragment pipelines
Rasterizer
– Mostly useful for interpolating values (texture coordinates) and
per-vertex constants
Texture unit
– Read-only memory interface
Render to texture
– Write-only memory interface
Courtesy GPGPU.org
15Mike Houston - Stanford University Graphics Lab
Vertex Processors
Fully programmable (SIMD / MIMD)
Processes 4-vectors (RGBA / XYZW)
Capable of scatter but not gather
– Can change the location of current vertex
– Cannot read info from other vertices
– Can only read a small constant memory
Vertex Texture Fetch
– Random access memory for vertices
– Limited gather capabilities
• Can fetch from texture
• Cannot fetch from current vertex stream
Courtesy GPGPU.org
16Mike Houston - Stanford University Graphics Lab
Fragment Processors
Fully programmable (SIMD)
Processes 4-component vectors (RGBA / XYZW)
Random access memory read (textures)
Generally capable of gather but not scatter
– Indirect memory read (texture fetch), but no indirect memory write
– Output address fixed to a specific pixel
Typically more useful than vertex processor
– More fragment pipelines than vertex pipelines
– Direct output (fragment processor is at end of pipeline)
– Better memory read performance
For GPGPU, we mainly concentrate on using the fragment
processors
– Most of the flops
– Highest memory bandwidth
Courtesy GPGPU.org
17Mike Houston - Stanford University Graphics Lab
GPGPU example – Adding Vectors
float a[5*5];
float b[5*5];
float c[5*5];
//initialize vector a
//initialize vector b
for(int i=0; i<5*5; i++)
{
c[i] = a[i] + b[i];
}
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Place arrays into 2D textures
Convert loop body into a shader
Loop body = Render a quad
– Needs to cover all the pixels in the
output
– 1:1 mapping between pixels and texels
Readback framebuffer into result array
!!ARBfp1.0
TEMP R0;
TEMP R1;
TEX R0, fragment.position, texture[0], 2D;
TEX R1, fragment.position, texture[1], 2D;
ADD R0, R0, R1;
MOV fragment.color, R0;
18Mike Houston - Stanford University Graphics Lab
How this basically works – Adding vectors
Bind Input Textures
Bind Render Targets
Load Shader
Render Quad
Readback Buffer
Set Shader Params
!!ARBfp1.0
TEMP R0;
TEMP R1;
TEX R0, fragment.position, texture[0], 2D;
TEX R1, fragment.position, texture[1], 2D;
ADD R0, R0, R1;
MOV fragment.color, R0;
Vector A Vector B
Vector C
C = A+B
,
19Mike Houston - Stanford University Graphics Lab
Rolling your own GPGPU apps
Lots of information on GPGPU.org
For those with a strong graphics background:
– Do all the graphics setup yourself
– Write your kernels:
• Use high level languages
– Cg, HLSL, ASHLI
• Or, direct assembly
– ARB_fragment_program, ps20, ps2a, ps2b, ps30
High level languages and systems to make GPGPU
easier
– Brook (http://graphics.stanford.edu/projects/brookgpu/)
– Sh (http://libsh.org)
20Mike Houston - Stanford University Graphics Lab
BrookGPU
History
– Developed at Stanford University
– Goal: allow non-graphics users to use GPUs for computation
– Lots of GPGPU apps written in Brook
Design
– C based language with streaming extensions
– Compiles kernels to DX9 and OpenGL shading models
– Runtimes (DX9/OpenGL) handle all graphics commands
Performance
– 80-90% of hand tuned GPU application in many cases
21Mike Houston - Stanford University Graphics Lab
GPGPU and the ATI X1800
22Mike Houston - Stanford University Graphics Lab
GPGPU on the ATI X1800
IEEE 32-bit floating point
– Simplifies precision issues in applications
Long programs
– We can now handle larger applications
– 512 static instructions
– Effectively unlimited dynamic instructions
Branching and Looping
– No performance cliffs for dynamic branching and looping
– Fine branch granularity: ~16 fragments
Faster upload/download
– 50-100% increase in PCIe bandwidth over last generation
23Mike Houston - Stanford University Graphics Lab
GPGPU on the ATI X1800, cont.
Advanced memory controller
– Latency hiding for streaming reads and writes to memory
• With enough math ops you can hide all memory access!
– Large bandwidth improvement over previous generation
Scatter support (a[i] = x)
– Arbitrary number of float outputs from fragment processors
– Uncached reads and writes for register spilling
F-Buffer
– Support for linearizing datasets
– Store temporaries “in flight”
24Mike Houston - Stanford University Graphics Lab
GPGPU on the ATI X1800, cont.
Flexibility
– Unlimited texture reads
– Unlimited dependent texture reads
– 32 hardware registers per fragment
512MB memory support
– Larger datasets without going to system memory
25Mike Houston - Stanford University Graphics Lab
Performance basics for GPGPU – X1800XT
(from GPUBench)
Compute
– 83 GFLOPs (MAD)
Memory
– 42 GB/s cache bandwidth
– 21 GB/s streaming bandwidth
– 4 cycle latency for a float4 fetch (cache hit)
– 8 cycle latency for a float4 fetch (streaming)
Branch granularity – 16 fragments
Offload to GPU
– Download (GPU -> CPU): 900 MB/s
– Upload (CPU -> GPU): 1.4 GB/s
http://graphics.stanford.edu/projects/gpubench
26Mike Houston - Stanford University Graphics Lab
A few examples on the
ATI X1800XT
27Mike Houston - Stanford University Graphics Lab
BLAS
Basic Linear Algebra Subprograms
– High performance computing
• The basis for LINPACK benchmarks
• Heavily used in simulation
– Ubiquitous in many math packages
• MatLab™
• LAPACK
BLAS 1: scalar, vector, vector/vector operations
BLAS 2: matrix-vector operations
BLAS 3: matrix-matrix operations
28Mike Houston - Stanford University Graphics Lab
BLAS GPU Performance
saxpy (BLAS1) – single precision vector-scalar product
sgemv (BLAS2) – single precision matrix-vector product
sgemm (BLAS3) – single precision matrix-matrix product
Relative Performance
0
1
2
3
4
5
6
7
8
9
10
11
12
saxpy sgemv sgemm
3.0GHz P4
2.5GHz G5
Nvidia 7800GTX
ATI X1800XT
29Mike Houston - Stanford University Graphics Lab
HMMer – Protein sequence matching
Goal
– Find matching patterns between protein sequences
– Relationship between diseases and genetics
– Genetic relationships between species
Problem
– HUGE databases to search against
– Queries take lots of time to process
• Researches start searches and go home for the night
Core Algorithm (hmmsearch)
– Viterbi algorithm
– Compare a Hidden Markov Model against a large database of
protein sequences
Paper at IEEE Supercomputing 2005
– http://graphics.stanford.edu/papers/clawhmmer/
30Mike Houston - Stanford University Graphics Lab
HMMer – Performance
X1800XT
Relative Performance
31Mike Houston - Stanford University Graphics Lab
Protein Folding
GROMACS provides extremely high
performance compared to all other programs.
Lot of algorithmic optimizations:
– Own software routines to calculate the
inverse square root.
– Inner loops optimized to remove all
conditionals.
– Loops use SSE and 3DNow! multimedia
instructions for x86 processors
– For Power PC G4 and later processors:
Altivec instructions provided
Normally 3-10 times faster than any other
program.
Core algorithm in Folding@Home
http://www.gromacs.org
32Mike Houston - Stanford University Graphics Lab
GROMACS - Performance
Relative Performance
0 0.5 1 1.5 2 2.5 3 3.5 4
3.0GHz P4
2.5GHz G5
NVIDIA 7800GTX
ATI X1800XT
Force Kernel Only
Complete Application
33Mike Houston - Stanford University Graphics Lab
GROMACS – GPU Implementation
Written using Brook by non-graphics programmers
– Offloads force calculation to GPU (~80% of CPU time)
– Force calculation on X1800XT is ~3.5X a 3.0GHz P4
– Overall speed up on X1800XT is ~2.5X a 3.0GHz P4
Not yet optimized for X1800XT
– Using ps2b kernels, i.e. no looping
– Not making use of new scatter functionality
The revenge of Ahmdal’s law
– Force calculation no longer bottleneck (38% of runtime)
– Need to also accelerate data structure building (neighbor lists)
• MUCH easier with scatter support
This looks like a very promising application for GPUs
– Combine CPU and GPU processing for a folding monster!
34Mike Houston - Stanford University Graphics Lab
Making GPGPU easier
35Mike Houston - Stanford University Graphics Lab
What GPGPU needs from vendors
More information
– Shader ISA
– Latency information
– GPGPU Programming guide (floating point)
• How to order code for ALU efficiency
• The “real” cost of all instructions
• Expected latencies of different types of memory fetches
Direct access to the hardware
– GL/DX is not what we want to be using
• We don’t need state tracking
• Using graphics commands is odd for doing computation
• The graphics abstractions aren’t useful for us
– Better memory management
Fast transfer to and from GPU
– Non-blocking
Consistent graphics drivers
– Some optimizations for games hurt GPGPU performance
36Mike Houston - Stanford University Graphics Lab
What GPGPU needs from the community
Data Parallel programming languages
– Lots of academic research
“GCC” for GPUs
Parallel data structures
More applications
– What will make the average user care about GPGPU?
– What can we make data parallel and run fast?
37Mike Houston - Stanford University Graphics Lab
Thanks
The BrookGPU team – Ian Buck, Tim Foley, Jeremy
Sugerman, Daniel Horn, Kayvon Fatahalian
GROMACS – Vishal Vaidyanathan, Erich Elsen, Vijay
Pande, Eric Darve
HMMer – Daniel Horn, Eric Lindahl
Pat Hanrahan
Everyone at ATI Technologies
38Mike Houston - Stanford University Graphics Lab
Questions?
I’ll also be around after the talk
Email: mhouston@stanford.edu
Web: http://graphics.stanford.edu/~mhouston
For lots of great GPGPU information:
– GPGPU.org (http://www.gpgpu.org)

Más contenido relacionado

La actualidad más candente

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsQuEST Global (erstwhile NeST Software)
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsVajira Thambawita
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...IOSR Journals
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 

La actualidad más candente (20)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing Units
 
20120140505010
2012014050501020120140505010
20120140505010
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 

Destacado

General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 

Destacado (20)

General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 

Similar a Cliff sugerman

byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorNajeeb Ahmad
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And EffectsThomas Goddard
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...IRJET Journal
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...AMD Developer Central
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 

Similar a Cliff sugerman (20)

byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Mantle for Developers
Mantle for DevelopersMantle for Developers
Mantle for Developers
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 

Más de clifford sugerman (20)

Clifford Sugerman
Clifford SugermanClifford Sugerman
Clifford Sugerman
 
Clifford Sugerman
Clifford Sugerman  Clifford Sugerman
Clifford Sugerman
 
Pollution brandon
Pollution brandonPollution brandon
Pollution brandon
 
More teens smoke e cigarettes
More teens smoke e cigarettesMore teens smoke e cigarettes
More teens smoke e cigarettes
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Clifford sugerman
Clifford  sugermanClifford  sugerman
Clifford sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Clifford sugerman
Clifford  sugermanClifford  sugerman
Clifford sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Clifford sugerman
Clifford  sugermanClifford  sugerman
Clifford sugerman
 
Clifford sugerman
Clifford  sugermanClifford  sugerman
Clifford sugerman
 
Clifford sugerman
Clifford  sugermanClifford  sugerman
Clifford sugerman
 
Clifford sugerman
Clifford  sugermanClifford  sugerman
Clifford sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 

Último

AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...Axel Bruns
 
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docxkfjstone13
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreiebhavenpr
 
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书Fi L
 
BDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdf28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdf29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...Diya Sharma
 
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxjohnandrewcarlos
 
How Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdfHow Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdfLorenzo Lemes
 
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost LoverPowerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost LoverPsychicRuben LoveSpells
 
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...narsireddynannuri1
 
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)Delhi Call girls
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkobhavenpr
 
Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Krish109503
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...AlexisTorres963861
 
Julius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the TableJulius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the Tableget joys
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Developmentnarsireddynannuri1
 

Último (20)

AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
 
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
 
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
如何办理(BU学位证书)美国贝翰文大学毕业证学位证书
 
BDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Greater Noida Escorts >༒8448380779 Escort Service
 
30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf
 
28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdf28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdf
 
29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdf29042024_First India Newspaper Jaipur.pdf
29042024_First India Newspaper Jaipur.pdf
 
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
₹5.5k {Cash Payment} Independent Greater Noida Call Girls In [Delhi INAYA] 🔝|...
 
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
 
How Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdfHow Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdf
 
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost LoverPowerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
Powerful Love Spells in Phoenix, AZ (310) 882-6330 Bring Back Lost Lover
 
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
 
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
 
Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
 
Julius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the TableJulius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the Table
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
 

Cliff sugerman

  • 1. General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University
  • 2. 2Mike Houston - Stanford University Graphics Lab A little about me http://graphics.stanford.edu/~mhouston Education: – UC San Diego, Computer Science BS – Stanford University, Computer Science MS – Currently a PhD candidate at Stanford University Research – Parallel Rendering – High performance computing – Computation on graphics processors (GPGPU)
  • 3. 3Mike Houston - Stanford University Graphics Lab What can you do on GPUs other than graphics? Large matrix/vector operations (BLAS) Protein Folding (Molecular Dynamics) FFT (SETI, signal processing) Ray Tracing Physics Simulation [cloth, fluid, collision] Sequence Matching (Hidden Markov Models) Speech Recognition (Hidden Markov Models, Neural nets) Databases Sort/Search Medical Imaging (image segmentation, processing) And many, many more… http://www.gpgpu.org
  • 4. 4Mike Houston - Stanford University Graphics Lab Why use GPUs? COTS – In every machine Performance – Intel 3.0 GHz Pentium 4 • 12 GFLOPs peak (MAD) • 5.96 GB/s to main memory – ATI Radeon X1800XT • 120 GFLOPs peak (fragment engine) • 42 GB/s to video memory
  • 5. 5Mike Houston - Stanford University Graphics Lab Task vs. Data parallelism Task parallel – Independent processes with little communication – Easy to use • “Free” on modern operating systems with SMP Data parallel – Lots of data on which the same computation is being executed – No dependencies between data elements in each step in the computation – Can saturate many ALUs – But often requires redesign of traditional algorithms
  • 6. 6Mike Houston - Stanford University Graphics Lab CPU vs. GPU CPU – Really fast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads – High performance on a single thread of execution GPU – Lots of math units – Fast access to onboard memory – Run a program on each fragment/vertex – High throughput on parallel tasks CPUs are great for task parallelism GPUs are great for data parallelism
  • 7. 7Mike Houston - Stanford University Graphics Lab The Importance of Data Parallelism for GPUs GPUs are designed for highly parallel tasks like rendering GPUs process independent vertices and fragments – Temporary registers are zeroed – No shared or static data – No read-modify-write buffers – In short, no communication between vertices or fragments Data-parallel processing – GPU architectures are ALU-heavy • Multiple vertex & pixel pipelines • Lots of compute power – GPU memory systems are designed to stream data • Linear access patterns can be prefetched • Hide memory latency Courtesy GPGPU.org
  • 8. 8Mike Houston - Stanford University Graphics Lab GPGPU Terminology
  • 9. 9Mike Houston - Stanford University Graphics Lab Arithmetic Intensity Arithmetic intensity – Math operations per word transferred – Computation / bandwidth Ideal apps to target GPGPU have: – Large data sets – High parallelism – Minimal dependencies between data elements – High arithmetic intensity – Lots of work to do without CPU intervention Courtesy GPGPU.org
  • 10. 10Mike Houston - Stanford University Graphics Lab Data Streams & Kernels Streams – Collection of records requiring similar computation • Vertex positions, Voxels, FEM cells, etc. – Provide data parallelism Kernels – Functions applied to each element in stream • transforms, PDE, … – No dependencies between stream elements • Encourage high Arithmetic Intensity Courtesy GPGPU.org
  • 11. 11Mike Houston - Stanford University Graphics Lab Scatter vs. Gather Gather – Indirect read from memory ( x = a[i] ) – Naturally maps to a texture fetch – Used to access data structures and data streams Scatter – Indirect write to memory ( a[i] = x ) – Difficult to emulate: • Render to vertex array • Sorting buffer – Needed for building many data structures – Usually done on the CPU
  • 12. 12Mike Houston - Stanford University Graphics Lab Mapping algorithms to the GPU
  • 13. 13Mike Houston - Stanford University Graphics Lab Mapping CPU algorithms to the GPU Basics – Stream/Arrays -> Textures – Parallel loops -> Quads – Loop body -> vertex + fragment program – Output arrays -> render targets – Memory read -> texture fetch – Memory write -> framebuffer write Controlling the parallel loop – Rasterization = Kernel Invocation – Texture Coordinates = Computational Domain – Vertex Coordinates = Computational Range Courtesy GPGPU.org
  • 14. 14Mike Houston - Stanford University Graphics Lab Computational Resources Programmable parallel processors – Vertex & Fragment pipelines Rasterizer – Mostly useful for interpolating values (texture coordinates) and per-vertex constants Texture unit – Read-only memory interface Render to texture – Write-only memory interface Courtesy GPGPU.org
  • 15. 15Mike Houston - Stanford University Graphics Lab Vertex Processors Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather – Can change the location of current vertex – Cannot read info from other vertices – Can only read a small constant memory Vertex Texture Fetch – Random access memory for vertices – Limited gather capabilities • Can fetch from texture • Cannot fetch from current vertex stream Courtesy GPGPU.org
  • 16. 16Mike Houston - Stanford University Graphics Lab Fragment Processors Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Generally capable of gather but not scatter – Indirect memory read (texture fetch), but no indirect memory write – Output address fixed to a specific pixel Typically more useful than vertex processor – More fragment pipelines than vertex pipelines – Direct output (fragment processor is at end of pipeline) – Better memory read performance For GPGPU, we mainly concentrate on using the fragment processors – Most of the flops – Highest memory bandwidth Courtesy GPGPU.org
  • 17. 17Mike Houston - Stanford University Graphics Lab GPGPU example – Adding Vectors float a[5*5]; float b[5*5]; float c[5*5]; //initialize vector a //initialize vector b for(int i=0; i<5*5; i++) { c[i] = a[i] + b[i]; } 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Place arrays into 2D textures Convert loop body into a shader Loop body = Render a quad – Needs to cover all the pixels in the output – 1:1 mapping between pixels and texels Readback framebuffer into result array !!ARBfp1.0 TEMP R0; TEMP R1; TEX R0, fragment.position, texture[0], 2D; TEX R1, fragment.position, texture[1], 2D; ADD R0, R0, R1; MOV fragment.color, R0;
  • 18. 18Mike Houston - Stanford University Graphics Lab How this basically works – Adding vectors Bind Input Textures Bind Render Targets Load Shader Render Quad Readback Buffer Set Shader Params !!ARBfp1.0 TEMP R0; TEMP R1; TEX R0, fragment.position, texture[0], 2D; TEX R1, fragment.position, texture[1], 2D; ADD R0, R0, R1; MOV fragment.color, R0; Vector A Vector B Vector C C = A+B ,
  • 19. 19Mike Houston - Stanford University Graphics Lab Rolling your own GPGPU apps Lots of information on GPGPU.org For those with a strong graphics background: – Do all the graphics setup yourself – Write your kernels: • Use high level languages – Cg, HLSL, ASHLI • Or, direct assembly – ARB_fragment_program, ps20, ps2a, ps2b, ps30 High level languages and systems to make GPGPU easier – Brook (http://graphics.stanford.edu/projects/brookgpu/) – Sh (http://libsh.org)
  • 20. 20Mike Houston - Stanford University Graphics Lab BrookGPU History – Developed at Stanford University – Goal: allow non-graphics users to use GPUs for computation – Lots of GPGPU apps written in Brook Design – C based language with streaming extensions – Compiles kernels to DX9 and OpenGL shading models – Runtimes (DX9/OpenGL) handle all graphics commands Performance – 80-90% of hand tuned GPU application in many cases
  • 21. 21Mike Houston - Stanford University Graphics Lab GPGPU and the ATI X1800
  • 22. 22Mike Houston - Stanford University Graphics Lab GPGPU on the ATI X1800 IEEE 32-bit floating point – Simplifies precision issues in applications Long programs – We can now handle larger applications – 512 static instructions – Effectively unlimited dynamic instructions Branching and Looping – No performance cliffs for dynamic branching and looping – Fine branch granularity: ~16 fragments Faster upload/download – 50-100% increase in PCIe bandwidth over last generation
  • 23. 23Mike Houston - Stanford University Graphics Lab GPGPU on the ATI X1800, cont. Advanced memory controller – Latency hiding for streaming reads and writes to memory • With enough math ops you can hide all memory access! – Large bandwidth improvement over previous generation Scatter support (a[i] = x) – Arbitrary number of float outputs from fragment processors – Uncached reads and writes for register spilling F-Buffer – Support for linearizing datasets – Store temporaries “in flight”
  • 24. 24Mike Houston - Stanford University Graphics Lab GPGPU on the ATI X1800, cont. Flexibility – Unlimited texture reads – Unlimited dependent texture reads – 32 hardware registers per fragment 512MB memory support – Larger datasets without going to system memory
  • 25. 25Mike Houston - Stanford University Graphics Lab Performance basics for GPGPU – X1800XT (from GPUBench) Compute – 83 GFLOPs (MAD) Memory – 42 GB/s cache bandwidth – 21 GB/s streaming bandwidth – 4 cycle latency for a float4 fetch (cache hit) – 8 cycle latency for a float4 fetch (streaming) Branch granularity – 16 fragments Offload to GPU – Download (GPU -> CPU): 900 MB/s – Upload (CPU -> GPU): 1.4 GB/s http://graphics.stanford.edu/projects/gpubench
  • 26. 26Mike Houston - Stanford University Graphics Lab A few examples on the ATI X1800XT
  • 27. 27Mike Houston - Stanford University Graphics Lab BLAS Basic Linear Algebra Subprograms – High performance computing • The basis for LINPACK benchmarks • Heavily used in simulation – Ubiquitous in many math packages • MatLab™ • LAPACK BLAS 1: scalar, vector, vector/vector operations BLAS 2: matrix-vector operations BLAS 3: matrix-matrix operations
  • 28. 28Mike Houston - Stanford University Graphics Lab BLAS GPU Performance saxpy (BLAS1) – single precision vector-scalar product sgemv (BLAS2) – single precision matrix-vector product sgemm (BLAS3) – single precision matrix-matrix product Relative Performance 0 1 2 3 4 5 6 7 8 9 10 11 12 saxpy sgemv sgemm 3.0GHz P4 2.5GHz G5 Nvidia 7800GTX ATI X1800XT
  • 29. 29Mike Houston - Stanford University Graphics Lab HMMer – Protein sequence matching Goal – Find matching patterns between protein sequences – Relationship between diseases and genetics – Genetic relationships between species Problem – HUGE databases to search against – Queries take lots of time to process • Researches start searches and go home for the night Core Algorithm (hmmsearch) – Viterbi algorithm – Compare a Hidden Markov Model against a large database of protein sequences Paper at IEEE Supercomputing 2005 – http://graphics.stanford.edu/papers/clawhmmer/
  • 30. 30Mike Houston - Stanford University Graphics Lab HMMer – Performance X1800XT Relative Performance
  • 31. 31Mike Houston - Stanford University Graphics Lab Protein Folding GROMACS provides extremely high performance compared to all other programs. Lot of algorithmic optimizations: – Own software routines to calculate the inverse square root. – Inner loops optimized to remove all conditionals. – Loops use SSE and 3DNow! multimedia instructions for x86 processors – For Power PC G4 and later processors: Altivec instructions provided Normally 3-10 times faster than any other program. Core algorithm in Folding@Home http://www.gromacs.org
  • 32. 32Mike Houston - Stanford University Graphics Lab GROMACS - Performance Relative Performance 0 0.5 1 1.5 2 2.5 3 3.5 4 3.0GHz P4 2.5GHz G5 NVIDIA 7800GTX ATI X1800XT Force Kernel Only Complete Application
  • 33. 33Mike Houston - Stanford University Graphics Lab GROMACS – GPU Implementation Written using Brook by non-graphics programmers – Offloads force calculation to GPU (~80% of CPU time) – Force calculation on X1800XT is ~3.5X a 3.0GHz P4 – Overall speed up on X1800XT is ~2.5X a 3.0GHz P4 Not yet optimized for X1800XT – Using ps2b kernels, i.e. no looping – Not making use of new scatter functionality The revenge of Ahmdal’s law – Force calculation no longer bottleneck (38% of runtime) – Need to also accelerate data structure building (neighbor lists) • MUCH easier with scatter support This looks like a very promising application for GPUs – Combine CPU and GPU processing for a folding monster!
  • 34. 34Mike Houston - Stanford University Graphics Lab Making GPGPU easier
  • 35. 35Mike Houston - Stanford University Graphics Lab What GPGPU needs from vendors More information – Shader ISA – Latency information – GPGPU Programming guide (floating point) • How to order code for ALU efficiency • The “real” cost of all instructions • Expected latencies of different types of memory fetches Direct access to the hardware – GL/DX is not what we want to be using • We don’t need state tracking • Using graphics commands is odd for doing computation • The graphics abstractions aren’t useful for us – Better memory management Fast transfer to and from GPU – Non-blocking Consistent graphics drivers – Some optimizations for games hurt GPGPU performance
  • 36. 36Mike Houston - Stanford University Graphics Lab What GPGPU needs from the community Data Parallel programming languages – Lots of academic research “GCC” for GPUs Parallel data structures More applications – What will make the average user care about GPGPU? – What can we make data parallel and run fast?
  • 37. 37Mike Houston - Stanford University Graphics Lab Thanks The BrookGPU team – Ian Buck, Tim Foley, Jeremy Sugerman, Daniel Horn, Kayvon Fatahalian GROMACS – Vishal Vaidyanathan, Erich Elsen, Vijay Pande, Eric Darve HMMer – Daniel Horn, Eric Lindahl Pat Hanrahan Everyone at ATI Technologies
  • 38. 38Mike Houston - Stanford University Graphics Lab Questions? I’ll also be around after the talk Email: mhouston@stanford.edu Web: http://graphics.stanford.edu/~mhouston For lots of great GPGPU information: – GPGPU.org (http://www.gpgpu.org)